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ORGANIZATION  OF  THE  HEARSAY  II  SPEECH  UNDERSTANDING  SYSTEM 


Vidor  R.  Lesser,  Richard  0.  Fennell,  Lee  D.  Erman,  and  D.  Raj  Reddy 

Computer  Science  Department* 

Carnegie-Mellon  University 
Pittsburgh,  Pa.  15213 

ABSTRACT 


Hearsay  II  (HSII)  is  a system  currently  under  de  'elopment  at 
Carnegie-Mellon  University  to  study  the  connected  speech 
understanding  problem.  It  is  simi.'r  to  Hearsay  I (HSt)  in  that  it  is 
based  on  the  hypothesize-and-test  paradigm,  using  cooperating 
independent  knowledge  sources  communicating  with  each  Other 
through  a global  data  structure  (blackboard).  It  differs  in  the 
sense  that  many  o*  current  limitations  and  shortcomings  of  HS1  are 
resolved  in  HSt I. 

The  mam  new  features  of  the  Hearsay  II  system  structure 
are:  .)  the  representation  of  knowledge  as  self-activat.ng, 

asynchronous,  parallel  processes,  2)  the  representation  of  the 
partial  analysis  in  a generalized  3-dimensional  network  (the 
dimensions  being  level  of  representation  (eg.,  acoustic,  phonetic, 
phonemic,  lexical,  syntactic),  time,  and  alternatives)  with  contextual 
and  structural  support  connections  explicitly  specified,  3)  a 
convenient  modular  structure  for  incorporating  new  knowledge 
into  the  system  at  any  level,  and  4)  a system  structure  suitable 
for  execution  on  a parallel  processing  system. 

The  main  task  domain  under  study  is  the  retrieval  of  daily 
wire-service  news  stories  upon  voice  request  by  the  user.  The 


mam  paramrtric  representations  used  for  this  study  are  1/3 
octave  filter-bank  and  LPC-derived  vocal  tract  parameters 
(Knudsen,  1974,  and  Kriz,  1974)  The  acoustic  segmentation  and 
labeling  procedures  are  parameter-independent  (Goldberg,  et  al., 
1974).  The  acoustic,  phonetic,  and  phonological  components 
(Shockey  and  Erman,  1974)  are  feature-based  rewriting  rules 
which  transform  the  segmental  units  into  higher-level  phonetic 
units  The  vocabulary  size  for  the  task  $ approximately  1200 
words.  This  vocabulary  information  is  used  to  generate  word- 
level  hypothtses  from  phonetic  and  surface-phonemic  levels  based 
on  prosodic  (stress)  information.  The  syntax  for  the  task  permits 
simple  English-like  sentences  and  is  used  to  generate  hypotheses 
based  on  the  probability  of  occurrence  of  that  grammatical 
construct  (Rich,  1974).  The  semantic  model  is  based  on  the  news 
items  of  the  day,  analysis  of  the  conversation,  and  the  presence  of 
certain  content  words  in  the  partial  analysis.  This  knowledge  is  to 
be  represented  as  a production  system.  The  system  is  expected 
to  be  operational  on  a 16-processor  mini-computer  system  (Bell, 
el  al.,  1971)  being  built  at  CMU. 

This  paper  deals  primarily  with  the  issues  of  the  system 
organization  of  the  Hearsay  II  system. 


INTRODUCTION 

The  Hearsay  II  (HSII)  speech  understanding  system  is  a 
successor  to  the  Hearsay  (HSt)  system  (Reddy,  et  al.,  1973a, 
1 9 73b).  HSII  represents,  in  terms  of  both  its  system  organization 
and  its  speech  knowledge,  a significant  increase  in  sophistication 
and  generality  over  HSt  The  development  of  HSII  has  bean  based 
on  two  years  of  experience  with  a running  version  of  HSI,  a desire 
to  exploit  multiprocessor  and  network  computer  archit  .lure  tor 
efficient  implementation  (Bell,  et  al.,  1971,  1973,  and  Erman,  et  al., 
1973),  and  a desire  *o  handle  more  complex  speech  task  domains 
(eg,  larger  vocabularies,  less  restricted  grammars,  and  a more 
complete  set  of  knowledge  sources  including  prosodies,  user 
models,  etc  ) While  from  a conceptual  point  of  view  HSII  is  a 
natural  extension  of  the  framework  that  HSI  posited  for  a speech 
understanding  system,  it  differs  significantly  in  its  design  and  in  its 
details  of  implementation. 


The  HEARSAY  System  Model 


There  are  four  dimensions  along  which  knowledge 
representation  in  HSI  can  be  described: 

1)  function, 

2)  structure, 

3)  cooperation, 

4)  attention  focusing. 

The  function  of  a knowledge  source  (KS)  in  HSI  has  three 
aspects.  The  first  is  for  the  KS  to  know  when  it  has  something 
useful  to  contribute,  the  second  is  to  contribute  ils  knowledge 
through  the  mechanism  of  making  a hypothesis  (guess)  about  some 
aspect  of  the  speech  utterance,  and  the  third  is  to  evaluate  the 
contribution  of  other  knowledge  sources,  i.e.,  to  verify  and  reorder 
(or  reject)  the  hypotheses  made  by  other  knowledge  sources. 
Each  of  these  aspects  of  a KS  is  carried  out  with  respect  to  a 
particular  context,  the  context  being  some  subset  cf  the 
previously  generatecl  hypotheses.  Thus,  new  knowledge  is  built 
upon  the  educated  guesses  made  at  some  previous  time  by  other 
knowledge  sources. 


HSI  was  based  on  the  view  that  the  inherently  errorful 
nature  of  connected  speech  processing  could  be  handled  only 
through  the  efficient  use  of  multiple,  diverse  sources  0*  knowledge 
(Reddy,  et  al , 1970,  and  Newell,  et  al.,  1971).  The  major  focus  of 
the  design  of  HSt  was  the  development  of  a framework  for 
representing  these  diverse  sources  of  knowledge  t 1 their 
cooperation  (Reddy  and  Newell,  1974)  This  framework  is  the 
conceptual  legacy  which  forms  the  basis  for  the  HSU  design. 


The  structure  of  each  knowledge  source  in  HSI  is  specified 
so  that  it  is  independent  and  separable  from  all  other  KS’s  in  the 
system.  This  permits  the  easy  addition  of  new  types  0<  KS’s  and 
replacement  of  KS’s  with  alternative  versions  of  those  KS’s  Thus, 
the  system  structure  can  be  easily  adapted  to  new  speech  task 
domains  which  have  KS’s  specific  to  that  domain,  and  the 
contribution  of  a particular  KS  to  the  total  recognition  effort  can 
be  more  easily  evaluated. 


* This  research  was  supported  in  part  by  the  Advanced  Research  Projects  Agency  of 
the  Department  of  Defense  under  contract  no.  F44620-73-C-0074  and  monitored  by 
the  Air  Force  Office  of  Scientific  Research. 
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The  choice  ot  a trameworK  tor  cooperation  among 
Knowledge  sources  is  intimately  interwoven  with  the  tunction  and 
structure  of  krow|->dge  in  MSI.  The  mechanism  for  KS  cooperation 
involves  hypothesizing  and  testing  (creating  and  evaluating) 
hypotheses  in  a global  data  base  (blackboard)  Tne  generation  and 
modification  ot  globally  accessible  hypotheses  thus  becomes  the 
primary  means  ot  communication  between  diverse  KS’s  This 
mechanism  of  cooperation  allows  a KS  to  contribute  knowledge 
without  being  aware  ot  which  other  KS's  will  use  its  knowledge  or 
which  KS  conlriouted  the  knowledge  that  it  used  Thus,  each  KS 
can  be  made  independent  and  separable. 

The  global  data  base  that  KS’s  use  for  cooperation  contains 
many  possible  interpretations  ot  the  speech  data.  Each  of  these 
interpretations  represents  a “limited"  context  in  which  a KS  can 
possibly  contribute  information  by  proposing  or  validating 
hypotheses.  Attention  focusing  of  a KS  involves  choosing  which  of 
these  limited  contexts  it  will  operate  in  and  for  how  much 
processing  time  The  attention  focusing  strategy  is  decoupled 
from  the  functions  of  individual  knowledge  sources.  Thus,  the 
decision  of  whether  a KS  can  contribute  in  a particular  context  is 
local  to  the  KS,  while  the  assignment  of  that  KS  to  one  of  the  many 
contexts  on  which  it  can  possibly  operate  is  made  more  globally. 
This  decoupling  ot  focusing  strategy  from  knowledge  acquisition, 
together  with  the  decoupling  of  the  data  e^  ronment  (global  data 
base)  from  control  flow  (KS  invocation)  and  the  Tmited  context  in 
which  a KS  operates,  permits  a quck  refoccsing  of  attention  of 
KSs.  The  ability  to  retocus  quickly  is  very  important  in  a speech 
understanding  system  because  the  errorful  nature  of  speech  data 
and  processing  leads  to  many  potential  interpretations  of  the 
speech.  Thus,  as  soon  as  possible  after  an  interpretation  no 
longer  seems  the  most  promising,  the  activity  of  the  system  should 
be  retocused  to  the  new  most  promising  interpretation. 


OVERVIEW  OF  HEARSAY  I 

The  tollowing  is  a brief  description  of  the  HSI 
implementation  for  this  model  of  knowledge  source  representation 
and  cooperation.  (A  more  complete  description  is  contained  in 
Reddy,  e,  al.,  1973a,  1973b.)  This  descr.Kt  or.  will  then  be  used  to 
• w’rast  the  differences  of  implementation  philosophy  between  HSI 
d HSU. 


HEARSAY  I Implementation  Overview 

The  global  data  base  of  HSI  consists  of  partial  sentence 
hypotheses,  each  being  a sequence  of  words  with  non-overlapp'ng 
time  locations  in  the  utterance.  It  is  a partial  sentence  hypothesis 
because  not  all  of  the  utterance  may  be  descrioed  by  the  given 
sequence  of  words  In  particular,  gaps  in  the  knowledge  of  the 
utterance  are  designated  by  "filler"  words  The  partial  sentence 
hypotheses  also  contain  confidence  ratings  for  each  word 
hypothesis  and  a composile  rating  for  the  overall  sequence  Of 
words.  A sentence  hypothesis  is  the  focal  point  that  is  used  to 
invoke  a knowledge  source.  The  sentence  hypothesis  also 
contains  the  accumulation  of  all  information  that  any  knowledge 
-ource  has  contributed  to  that  hypothesis. 

Knowledge  sources  are  in>  oked  in  a lockstep  sequence 
cons  sting  of  three  phases:  poll,  hypothesize,  and  test.  At  each 
phase,  all  knowledge  sources  are  invoked  for  that  phase,  and  the 
next  phase  ooes  not  commence  unlit  all  KS’s  have  completed  the 
current  one  The  poll  phase  involves  determining  which  KS’s  have 
something  to  contribute  to  the  sentence  hypothesis  which  is 
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currently  being  tocu-.ed  upon;  polling  also  determines  how 
rontident  each  KS  is  aboul  its  proposed  coni  ibutions  The 
hypothesize  phase  consists  ot  invoking  the  KS  showing  the  most 
confidence  about  its  proposed  contribution  of  knowledge  This  KS 
the1  hypothesizes  - set  of  possible  words  (option  words)  for  some 
(one)  "filler"  word  in  the  speech  utterance  The  te-  ting  phase 
consists  Ot  each  KS  evaluat  ig  (verifying)  the  possible  option 
words  with  respect  to  the  given  context  Alter  all  knowledge 
sources  have  completed  their  verifications,  the  option  words  which 
seem  most  likely,  based  on  the  combined  ratings  of  all  the  KS’s, 
are  then  used  to  construe!  new  partial  sentence  hypotheses.  The 
global  da’ a base  is  then  re-evaluated  to  find  the  most  promising 
sentence  hypothesis;  this  hypothesis  then  becomes  the  focal  point 
for  the  next  hypothesize -and-test  cycle 

Problems  with  HEARSAY  1 

There  are  four  major  design  decisions  in  the  HSI 
implementation  of  knowledge  representation  ond  coope  alien 
which  prevent  HSI  from  being  applied  to  more  complex  speech 
tasks  or  multiprocesso  environments 

The  first,  and  most  important,  of  these  limiting  decisions 
concerns  the  use  of  the  hypothesize-and-test  paradigm.  As 
implemented  in  HSI,  the  paradigm  is  exploited  only  at  the  word 
level.  The  implication  of  hypothesizing  and  testing  at  only  the 
word  level  is  that  the  knowledge  representation  is  uniform  only 
with  respect  to  cooperation  at  that  level.  That  is,  the  information 
content  of  any  element  in  the  global  data  base  is  limited  to  a 
description  at  the  word  level.  The  addition  nf  non-word  level  KS’s 
(ie  , KS’s  cooperating  via  either  sub-word  levels,  such  as  syllables 
or  phones,  or  via  super-word  levels,  such  as  phrases  Or  concepts) 
thus  becomes  cumbersome  because  this  knowledge  must  somehow 
be  related  to  hypothesizing  and  testing  at  the  word  level.  This 
approach  to  non-word  le\el  KS’s  makes  it  difficult  to  add  non- 
word knowledge  and  to  evaluate  the  contribution  of  this 
knowledge.  In  addition,  the  inability  to  share  non-word  level 
information  among  KS’s  cau  ps  such  information  to  he  recomputed 
by  each  KS  that  needs  it. 

Secondly,  HSI  constrains  the  hypothesize-and-test  paradigm 
to  operate  in  a lockstep  control  sequence.  The  effect  of  this 
decision  is  to  limit  parallelism,  because  the  time  required  to 
complete  a hypothesize-and-test  cycle  is  the  maximum  time 
required  by  any  single  hypothesizer  KS  plus  the  maximum  time 
required  by  any  single  verifier  (testing)  KS.  Another  disadvantage 
Of  this  control  scheme  is  that  it  increases  the  time  it  takes  the 
system  to  refocus  attention,  because  there  is  no  provision  for  any 
communication  of  partial  results  among  KS’s.  Thus,  for  example,  a 
rejection  of  a particular  option  word  by  a KS  will  not  be  noticed 
until  all  the  KS  s have  tps.v,d  jii  the  option  words. 

The  third  weakness  in  the  HSI  implementation  concerns  the 
structure  of  the  global  data  base:  there  is  no  provision  for 
specifying  relationships  among  alternative  sentence  hypotheses. 
The  absence  of  relat  onal  structures  among  hypotheses  has  the 
effect  of  increasing  fhe  overall  computation  time  and  increasing 
the  time  to  refocus  attention,  because  the  information  gained  by 
working  on  one  hypothesis  cannot  be  shared  by  propagating  it  to 
other  relevant  hypotheses. 

The  fourth  limiting  design  decision  relates  to  how  a global 
problem-solving  strategy  (policy)  is  implemented  in  HSI:  policy 
deci'  n.  ,,  such  as  those  nvolving  attention  focusing,  are 
centra,,,  *d  (in  a "Recognition  Overlord"),  and  there  is  no  coherent 
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structure  tor  Hie  policy  algorithms.  'he  etted  of  havirg  no 
explicit  system  structure  for  implenenling  policy  decisions  makes 
it  very  awkwaro  to  add  or  delete  new  policy  algorithms  and 
difficult  !o  analyze  the  effectiveness  of  a policy  and  its  interaction 
with  other  nolicies 

OVERVIEW  OF  HEARSAY  11 

Experience  with  HSt  (as  described  above)  has  led  to  several 
important  observations  about  a me-e  general,  unitorm.  and  natural 
s.zucture  tor  epresentmg  and  operatng  on  the  (dynamic)  state  ot 
the  utterance  recognition  * 

t The  internal  structure  ot  hypotheses  at  different  levels  ot 
knowledge  representation  may  be  essentially  the  same, 
except  for  the  primitive  unit  of  information  held  in  an 
hypothesis.  This  structural  homogeneity  in  the  global  data 
base  allows  the  actions  ot  hypothesizing  and  testing  at 
these  various  levels  to  be  treated  in  a unitorm  manner. 

2.  The  diftc  ent  types  of  knowledge  (and  their  relationships) 
present  in  speech  may  be  naturally  represented  in  a 
single,  uniform  data  structure  This  data  structure  is  3- 
dimensional  one  dimension  represents  information  levels 
(eg.,  phrasal,  lexical,  phonetic),  the  second  represents 
speech  time,  and  the  third  dimension  contains  alternative 
(competing)  nypotheses  at  a particular  level  and  time. 
These  three  dimensions  torm  a convenier.t  addressing 
structure  for  locating  hypotheses 

3 There  is  a conceptually  simple  and  uniform  way  ot 
dynamically  relat  ng  hypotheses  at  one  level  of  knowledge 
to  alternative  hypotheses  at  that  level  and  to  hypotheses 
at  other  knowledge  levels  in  the  structure.  The  resulting 
structure  is  an  AND/OP  graph  with  modifications  which 
provide  for  temporal  relationships  and  selective 
dependency  relationships.*’ 

System  Structure 

The  mam  goaf  of  the  HStl  design  is  to  extend  the  concepts 
developed  in  HSt  for  the  representation  and  cooperation  of 
knowledge  at  the  word  level  to  alt  levels  ot  knowledge  needed  in  a 
speech  understanding  system,  based  on  the  preceding 
observations 

The  generalization  of  the  hypothesize-and-test  paradigm  to 
all  levels  o'  speech  knowledge  implies  the  need  fo-  a mechanism 
for  transferring  information  among  levels.  This  mechanism  is 
already  embodied  in  the  hypothesize-and-test  paradigm;  that  is, 
one  can  characterize  two  types  ot  hypothesization  a knowledge 
source  might  be  called  upon  to  perform:  horizontal  and  vertical 
hypothesization.  A hypothesization  is  horizontal  when  a KS  uses 
contextual  information  at  a given  knowledge  level  to  predict  new 
hypotheses  at  the  same  level,  (eg.,  the  hypothesization  that  the 
word  "night"  mij  ht  follow  the  the  sequence  of  words  "day"  - 
"and");  whereas  a hypothesization  is  vertical  when  a KS  uses 
information  at  one  level  in  the  data  base  to  predict  new 


* The  meaning  of  these  observations  win  be  made  more  dear  by 
the  further  descriptions  below 

**  This  latter  feature  relers  to  "connection  matrio  s"  and  is 
described  below  in  more  detail. 
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hypotheses  at  a different  level  leg,  the  generation  of  a 
hypothesis  that  a [T]  occurred  when  a segment  ot  silence  is 
followed  by  a segmenl  of  aspiration) 

The  HStt  implementahor  of  the  hypothesize  and  test 
paradigm  has  also  resulted  in  a generalization  of  the  lockstep 
control  scheme  for  KS  sequencing  employed  by  HSt  HStt  relaxes 
the  constraints  on  the  hypothesize  and  test  paradigm  and  allows 
the  knowledge-source  processes  to  run  in  an  asynchronous,  data 
directed  manner.  A knowledge  source  is  instantiated  as  a 
knowt» -b»-source  process  whenever  the  data  base  exhibits 
charac. eristics  which  satisfy  a "precondition"  of  the  knowledge 
source.  A precondition  ot  a KS  is  a description  of  some  partial 
state  of  fhe  data  base  which  defines  when  and  where  the  KS  can 
contribute  its  knowledge  by  modifying  the  data  base  Such  a 
modification  might  be  adding  new  hypotheses  proposed  by  the  KS 
(at  the  information  level  appropriate  for  that  KS)  or  veritying 
(criticizing)  hypotheses  which  already  exist  The  modifications 
made  by  any  given  knowledge-source  process  are  expected  to 
trigger  further  knowledge  sources  by  creating  new  conditions  in 
the  data  base  to  which  those  knowledge  sources  respond.  The 
structure  ot  a hypothesis  has  been  so  designed  as  to  allow  the 
preconditions  of  most  KS’s  to  be  sensitive  to  a single,  simple 
change  in  some  hypothesis  (such  as  the  changing  ot  a rating  or  the 
creation  of  a structural  link).  Through  this  data-directed 
interpretation  of  the  hypothesize-and-test  paradigm,  HSI1 
knowledge  sources  exhibit  a high  degree  of  asynchrony  and 
polenlia'  parallelism.  A side-effect  of  this  more  general  control 
scheme  for  HStt  is  that  the  strategy  need  not  be  centralized  and 
implemented  as  a monolithic  overlord,  but  rather  can  be 
implemented  as  policy  modules  which  operate  in  precisely  the  same 
manner  as  KS‘s. 

The  3-dimensional  data  base,  augmented  by  the  AND/OR 
structural  relationships  specified  over  that  data  base,  permits 
intormation  generated  by  one  knowledge  source  to  be:  1) 

retained  for  use  by  other  KS's,  and  2)  quickly  propagated  to 
other  relevant  parts  of  the  data  base.  This  retention  and 
propagation  provide  two  importan  features  for  solving  a complex 
problem  in  which  errors  are  high  y likely.  First,  quick  refocusing 
can  occur  when  a particular  pato  no  longer  appears  promising. 
Second,  "selective"  backtrack, ng  may  be  used;  i.e.,  when  a KS  tinds 
that  it  has  made  an  incorrect  dension,  it  does  not  have  to  eliminate 
all  information  generated  since  that  derision,  but  rather  only  that 
subset  which  depends  on  the  incorrect  decision,  tn  this  way, 
information  generated  by  one  knowledge  source  is  retained  and  is 
usable  by  itself  and  other  KS's  in  other  relevant  contexts. 

Summarizing,  HStt  is  based  on  the  views:  i)  that  the  state 
of  the  recognition  can  be  represented  in  a unitorm,  multilevel  data 
base,  and  2)  that  speech  knowledge  can  be  characterized  in  a 
natural  manner  by  describing  many  small  knowledge  sources 
These  knowledge  sources  react  to  certain  states  of  the  data  base 
(via  their  preconditions)  and,  once  instantiated  as  knowledge- 
source  processes,  provide  their  own  changes  to  the  data  base 
which  contribute  to  the  progress  of  the  recognition.  The 
hypothesize  and  te  t paradigm,  when  stated  in  sufficient  y non 
restrictive  (parallel)  terms,  serves  to  describe  the  general 
interactions  among  these  knowledge  sources  tn  particular, 
changes  made  by  one  or  more  knowledge-source  processes  may 
trigger  other  knowledge  sources  to  react  to  these  changes  by 
validating  (testing)  them  or  hypothesizing  further  changes.  The 
intent  of  HStt  is  to  pro'-ide  a framework  within  which  to  explore 
various  configurations  of  information  levels,  knowledge  sources, 
and  global  strategies.* 
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t-rom  a more  general  point  of  view,  the  goal  of  HStt  is  to 
prov  de  a multiprocesc -oriented  software  architecture  to  serve  as 
a basis  for  systems  of  cooperating  (but  independent  and 
asynchronous)  data  directed  Knowledge-source  processes.  The 
purpose  of  such  a structure  is  to  achieve  effective  parallel  search 
over  a general  art  ficial  inteTigence  problem-solving  graph, 
employing  the  hypothesize-and  test  paradigm  to  generate  the 
search  graph  and  using  a uniform,  interconnected,  multilevel  global 
data  base  as  tne  primary  means  of  interprocess  communication, 

HEARSAY  il  SYSTEM  DESIGN  AND  IMPLEMENTATION 

One  can  derive  from  the  description  of  the  desired  HSIi 
recognition  process  given  above  several  basic  components  of  the 
required  system  structure.  First,  a sufficiently  general  structured 
fclfibfll  bdV£  is  needed,  through  which  the  knowledge  sources 
may  communicate  by  inserting  hypotheses  and  by  inspecting  and 
modifying  the  hypotheses  placed  there  by  other  knowledge 
sources  Second,  some  means  for  describing  the  various 
knowledge  sources  and  their  internal  processing  capabilities  is 
required.  Third,  in  order  to  have  knowledge  sources  activated  in  a 
data-directed  manner,  a method  is  required  by  which  a set  of 
BlfeondlllODS  may  be  specified  and  associated  with  each 
knowledge  source.  Fourth,  in  order  to  detect  the  satisfaction  of 
these  preconditions  and  in  order  to  allow  knowledge  sources  to 
locate  parts  of  the  data  base  in  which  they  are  interested,  two 
mechanisms  are  needed:  1)  a mon'tonng  mechanism  to  record 

where  in  the  data  base  changes  have  occu.ied  and  the  nature  ot 
those  changes,  and  2)  an  associative  retrieval  mechanism  for 
accessing  parts  of  the  data  base  which  conform  to  particular 
pattern;  which  are  specified  as  matching-prototypes 

Elements  of  the  System  Structure 

The  following  sections  outline  the  HSIt  implementation  of  the 
various  basic  system  components. 

Global  Data  Base  The  design  of  HSII  is  centered  around  a 
global  data  base  (blackboard)  whic  . is  accessible  to  all  knowledge- 
source  processes  The  global  data  base  is  structured  as  a uniform, 
multilevel,  interconnected  date,  structure 

Each  level  in  the  data  base  contains  a (potentially  complete) 
representation  of  the  utterance;  trie  leve's  are  differentiated  by 
the  units  that  make  up  the  representation,  e.g.,  phrases,  words, 
phonemes.  The  system  structure  of  HStl  does  not  pre-specify 
what  the  levels  in  the  global  data  structure  ire  to  be  A particular 
configuration,  called  HSII-CO  (Contigura  ion  Zero),  is  being 
implemented  as  the  first  test  of  the  HStt  structure.  Figure  1 
shows  a schematic  of  the  levels  o!  HSII-CO,  A r.ore  detailed 
description  and  justification  for  this  particular  configuration  can  be 
tound  in  Shockey  and  Erman  (1974).  This  configuration  will  be 
used  as  the  basis  for  examples  to  illustrate  various  aspects  ot  the 
HSII  system 

Eaumeil'i  Level  - The  parametric  level  holds  the  most  basic 
representation  of  the  uttc-ance  that  the  system  has;  it  is  the 
only  direct  input  to  the  machine  about  the  acoustic  signal. 


» It  is  intereMing  to  note  that  this  generalized  torm  of 
hypothesize-and  test  leads  to  a system  organization  with  some 
characteristics  similar  to  QA4  (Rulifson,  et  al„  1973)  and 
PLANNER  (Hewitt,  1972)  tn  particular,  there  are  strong 
similarities  in  the  data  directed  sequencing  of  processes 
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Figure  1.  The  Levels  in  HSII-CO 


Several  ditferent  sets  ot  parameters  are  being  use!  in  HStl- 
CO  interchangeably:  1/3-octave  filter-band  energies 

measured  every  10  msec.,  LPC -derived  vocal-trad 
parameters  (Knudsen,  1974),  and  wide-band  energies  and 
zero-crossing  counts. 

Segmental  Level  - This  level  represents  the  utterance  as 
labeled  acoustic  segments  Although  the  set  ot  labels  may  be 
phonetic-like,  the  level  is  not  intended  to  be  phonetic  --  the 
segmentation  and  labeling  retied  acoustic  mamtestation  and 
do  not,  for  example,  attempt  to  compensate  for  the  context  of 
the  segments  or  attempt  to  combine  acoustically  dissimilar 
segments  into  (phonetic)  units. 

As  with  all  levels,  any  particular  portion  of  the  utterance  may 
be  represented  by  more  than  one  competing  hypothesis  (i.e., 
multiple  segmentations  and  tabelmgs  may  co-exist). 

EhonetH  Level  - At  this  level,  the  utterance  is  represented  by  a 
phonetic  description.  This  is  a broad  phonetic  description  in 
that  the  size  (duration)  of  the  units  is  on  the  order  ot  the 
size  ot  phonemes;  it  is  a fin*  phonetic  description  to  tfv 
extent  that  each  element  is  labeled  with  a tairty  detailed 
allophomc  classitication  (e  g.,  "stressed,  nasalized  [I]"). 

Surface-Phonemic  Level  - This  level,  named  by  seemingly 
contradicting  terms,  represents  the  utterance  by  phoneme- 
tike  units,  with  the  addition  of  modifiers  such  as  stress  and 
boundary  (word,  morpheme,  syllable)  markings. 

Syllable  Level  - The  unit  of  representation  here  is  the  syllable. 

Lexiial  Level  - The  unit  of  information  at  this  level  is  the  word 
(Note  again  that  at  any  level  competing  representations  can 
be  accommodated.) 

Level  - Syntactic  elements  appear  at  this  level  In  fact, 
since  a level  may  contain  arbitrarily  many  "sub-tevets"  of 
elements  structured  as  a modified  AND/OR  graph,  traditional 
kinds  of  syntactic  trees  can  be  directly  represented  here 

Conceptual  Level  - The  units  at  this  level  are  "conceots"  As 
with  the  phrasal  level,  it  may  be  appropriate  to  use  the  graph 
structure  of  the  data  base  to  indicate  relationships  among 
different  concert 

The  basic  unit  m the  data  structure  is  a nod*;  a node 
represents  the  hypothesis  that  a particular  element  exists  in  the 
utterance  For  example,  an  hypothesis  at  the  phonetic  level  may 
be  labeled  as  'T’  Besides  containing  the  hypothesis  element  name, 
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a node  holds  several  other  kinds  of  information,  including,  li  a 
correction  with  a particular  time  period  in  the  speech  utterance,* 
2)  scheduling  parameters  (validity  rat.ngs,  attention  torus  factors, 
measures  of  computing  effort  expended,  etc  ),  and  d)  connection 
informetion  which  relates  the  node  to  other  nodes  in  the  data  base 

Structural  relationships  between  nodes  'hypotheses)  are 
represented  through  the  use  of  links;  links  provide  a means  tor 
specifying  conle»lual  abstractions  about  the  relationships  of 
hypotheses  A link  is  an  element  in  the  data  structure  which 
associates  two  nodes  as  an  orde.ed  pair,  one  of  the  nodes  is 
termed  the  uppar  hypothesis  and  the  other  is  called  the  lower 
hypothesis  The  lower  hypothesis  is  said  to  support  the  upper 
hypothesis;  the  upper  hypothesis  is  called  a use  ot  the  lower  one 
There  are  several  types  Of  links;  in  general,  if  a node  serves  as 
the  upper  hypothesis  tor  more  than  one  link,  all  of  those  links 
must  be  Of  the  same  type,  Tnus,  one  can  talk  of  the  "type  Ot  the 
hypothesis,”  which  is  the  same  as  the  type  ot  all  of  its  lower  links. 


Figure  2.  Examples  ot  SEQUENCE  and  OPTION  Relationships. 


The  two  most  important  structural  relationship  types  are 
SEQUENCE  ano  OPTION 

A SEQUENCE  node  is  an  hypothesis  that  is  supported  by  a 
(timewise)  sequential  set  ot  hypotheses  at  a lower  level  (or 
sublevel  --  see  below).  For  example,  Figure  2a  shows  an 
hypothesis  ot  'will'  at  the  lexical  level  supported  by  the 
(!ime-)ordered  contiguous  sequence  of  'W*.  'IN',  and  ’L’  at 
Ihe  surface-phonemic  level.  The  interpretation  ot  a 
SEQUENCE  node  is  that  all  of  the  lower  hypotheses  must 
be  valid  in  order  to  support  the  upper  hypothesis. 

An  OPTION  node  is  an  hypothesis  that  has  alternative 
supports  trom  two  or  more  hypotheses,  each  ot  which 
covers  essent  ally  the  same  time  period.  For  example, 
Figure  2b  shows  the  hypothesis  ot  'noun'  at  the  phrasal 
level  as  be  ng  supported  by  any  of  'boy',  'toy',  or  'lie',  all 
ot  which  are  competing  word  hypotheses  covering 
(approximately)  the  same  time  area  ** 

Figure  3 is  an  example  ot  a larger  fragment  of  the  global 
data  structure.  The  level  of  an  hypothesis  is  indicated  by  its 
vertical  position;  the  names  of  the  levels  are  given  on  the  left. 


* This  time  period  is  specified  with  an  explicit  estimabon  ot 
fuzziness,  even  to  the  extent  ot  spanning  the  entire  utterance. 

**  In  addition  to  SEQUENCE  and  OPTION,  there  are  two  kinds  ot 
structural  relationships  which  are  generalizations  of  them;  An 
ANO  node  is  sin-  lar  to  a SEQUENCE  node  except  that  there  is  no 
implication  of  any  time  sequentiality  among  the  supports  --  they 
may  overlap  or  be  disjoint.  An  OR  node  is  similar  to  an  OPTION 
node  in  that  Ihe  supports  represent  alternatives,  but  (as  with 
the  ANE  node)  there  is  no  time  requirement 
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Figure  3.  An  Example  ot  the  Dala  Structure 

Time  location  is  approximately  indicated  by  horizontal  placement, 
but  duration  s only  very  roughly  indicated  (eg.,  the  boxes 
surrounding  the  two  hypotheses  at  the  phrasal  level  should  be 
much  wider).  A'fernatives  are  indicated  by  proximity;  for  example, 
'will'  and  'woulo'  are  word  hypotheses  covering  the  same  time 
span.  Likewise,  'question'  and  'modal-question',  ’youl’  and  'yOu2', 
and  'J'  and  'Y*  all  represent  pairs  ot  alternatives. 

This  example  illustrates  several  features  pf  the  data 
structure: 

The  hypothesis  'you!  at  the  lexical  level,  has  two  alternative 
phonemic  "spellings"  indicated;  the  hypotheses  labeled 
’youl’  and  'you2'  are  nodes  created,  'Iso  at  the  lexical 
level,  to  hold  those  alternatives.  In  general,  such  sub- 
levels  may  be  created  arbitrarily. 

The  link  between  'youl’  and  'D'  is  a special  kind  of  SEQUENCE 
link  (indicated  here  by  a dashed  line)  called  a CONTEXT  link; 
a CONTEXT  link  indicates  that  the  lower  hypothesis 
supports  the  upper  one  and  is  contiguous  to  its  brother 
links,  but  it  is  not  "part  ot"  the  upper  hypothesis  in  the 
sense  that  it  is  not  within  the  time  interval  ot  the  upper 
hypothesis  --  rather,  it  supplies  a context  for  its 
brother(s).  In  this  case,  one  may  "read"  the  structure  as 
stating  "'youl'  is  composed  ot  'J'  followed  by  'AX'  (schwa) 
in  the  context  of  the  preceding  'D'"  (This  reflects  the 
phonological  rule  that  "would  you"  is  often  spoken  as 
"would-ja")  Thus,  a CONTEXT  link  allows  important 
contextual  relationships  to  be  represented  without 
violating  the  implied  time  assumptions  about  SEQUENCE 
nodes. 

The  phonemic  spelling  of  tne  woro  ">ou"  held  by  'youl 
includes  a contextual  constraint  (as  just  described);  the 
'yOu2'  option  does  not  have  this  constraint  However, 
'yOul'  and  'you2  are  such  similar  hypotheses  that  there  is 
strong  reason  for  wanting  to  retam  them  as  alternative 
options  under  'you'  (as  indicated  in  F gure  3),  rather  than 
representing  them  jnconnectedly.  In  general,  the  problem 
is  that  the  use  of  an  hypothecs  mipiies  certain  contextual 
assumptions  aoout  its  environment,  wnile  the  supoort  ot  an 
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hypothesis  may  itself  be  predicated  on  a particular  set  of 
contextual  assumptions  A mechar  sm,  called  a connection 
matrix,  ex  "its  in  the  dafa  strucfure  to  represent  this  Kind  of 
relationship  by  specifying,  for  an  OPTION  hypothesis,  which 
of  ifs  alternative  supports  are  applicable  ("connected  to") 
which  of  its  uses.  In  this  example,  the  connection  matrix  of 
'you'  (symbolized  in  Figure  3 by  the  2-dimensional  binary 
matrix  in  the  nnde)  specifies  that  support  'youl'  is  relevant 
to  use  ‘question’  (but  not  fo  modal -question’)  , nd  that 
support  'you2’  is  relevant  to  both  uses.  The  use  of  a 
connecf'or  matrix  allows  the  efforts  of  preceding  KS 
decisions  to  be  accumulated  for  fufure  use  and  modification 
without  necessitating  contextual  duplication  of  parts  of  the 
data  base.  Thus,  much  of  the  duplication  of  effort  due  to 
the  backtracKmg  mode  of  HSI  is  avoided  in  HSU 

Besides  showing  structural  relationships  (i.e.,  that  one  unit  is 
composed  of  several  other  units),  a link  is  a statement  about  the 
degree  to  which  one  hypothesis  implies  { i . e gives  evidence  for 
the  existence  of)  another  hypothesis.  The  strength  of  the 

implication  is  held  information  on  the  tmk.  The  sense  of  the 

implication  may  be  negative;  that  is,  a link  may  indicate  that  one 

hypothesis  is  evidence  for  the  invalidity  of  another.  Finally,  this 
statement  of  implication  is  bi-directional;  the  existence  of  the 
upper  hypothesis  may  give  credence  to  the  existence  of  the  lower 
hypothesis  and  vice  versa. 

The  nature  of  the  implications  represented  by  the  links 
provides  a uniform  basis  for  propagating  changes  made  in  one  part 
of  the  data  structure  to  other  relevant  parts  without  necessarily 
requiring  the  intervention  of  particular  knowledge  sources  at  each 
step.  Considering  the  example  of  Figure  3,  assume  that  the 

validity  of  f he  hypothesis  labeled  'J'  is  modified  by  some  KS 
(presumably  operating  at  the  phonefic  level)  and  becomes  very 
low  One  possible  scenario  for  rippling  this  change  through  the 
data  base  is: 

First,  the  estimated  validity  of  ’youl’  is  reduced,  because  *J’ 
is  a lower  hypothesis  of  'youl! 

This,  in  turn,  may  cau'e  the  rating  of  'you’  to  be  reduced. 

The  connection  matrix  at  'you'  specfies  that  'yout’  is  not 
relevant  lo  'modal-question)  so  the  latter  hypothesis  is  not 
affected  by  the  change  in  rating  of  the  former,  Notice  that 
the  existence  of  the  connechon  matrix  allows  this  decision 
to  be  made  locally  in  the  data  structure,  without  having  to 
search  back  down  to  the  ’0’  and  'J! 

'Question)  however,  is  supported  by  'yout’  (through  the 
connection  matrix  at  'you’),  so  its  rating  is  affected. 

Further  propagations  can  continue  to  occur,  perhaps  down 
the  other  SEQUENCE  inks  under  ’question’  a.id  ’youl) 

Knowledge  Sautse  Specification.  A knowledge  source  is 
specified  in  three  parts:  1)  the  conditions  under  which  it  is  to  be 
activated  (in  terms  of  the  data  base  conditions  in  which  it  is 
interested,  as  described  in  the  section  on  "preconditions"  below), 
2)  tne  kinds  of  changes  it  makes  to  the  global  data  base,  and  3)  a 
procedural  statement  (program)  of  the  algorithm  which 
accomplishes  those  changes.  A knowledge  source  is  thus  defined 
a'  possessing  some  processing  capability  which  is  able  to  solve 
some  subproblem,  given  approprate  circumstances  for  its 
activation 
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The  decomposition  of  the  overall  recognition  fask  into 

various  knowledge  sources  is  regarded  as  being  a natural 

decomposition  That  is,  the  units  of  the  decomposition  represent 
those  pieces  of  knowledge  which  can  be  distinguished  and 

recognized  as  being  somehow  naturally  independent « Given  a 
sufficient  set  of  such  knowledge  sources  (that  is,  a set  that 
provides  en  igh  overall  connectivity  among  the  various  levels  of 
the  data  base  that  a final  recognition  can  be  attained),  the 

collection  is  called  the  "overall  recognition  system."  Such  a scheme 
of  "inverse  decomposition"  (or,  composition)  seems  very  natural 
foi  many  problem-solving  tasks,  and  it  fits  well  into  the 
hypothesize  and-test  approach  to  problem-solving.  As  long  as  a 
sufficient  "covering  set"  for  the  data  base  connections  is 
maintained,  one  can  freely  add  new  knowledge  sources,  or  replace 
or  delete  old  ones.  Each  knowledge  source  is  in  some  sense  self- 
contained,  but  each  is  expected  to  cooperate  with  the  other 
knowledge  sources  that  happen  to  be  present  in  the  system  at 
that  time. 


- Levels  - 


- Knowledge  Sources  - 


CONCEPTUAL 

: rlRASAL 

LEXICAL 

SYLLABIC 

SURFACE- 

PHONEMIC 

PHONETIC 

SEGMENTAL 

PARAMETRIC 


Syntactic  Word  Hypothesizer 
Phoneme  Hyoothesizer 


Phone-Phoneme  Synchronizer 
Phone  Synthesizer 
Segmenter-Qassifier 


Figure  4.  The  First  Knowledge  Sources  for  HSII-CO. 


As  examples  of  knowlecfge  sources,  Figure  4 shows  the  first 
set  of  processes  implemented  for  HStl-CO  The  levels  are 
indicated  as  horizontal  lines  in  the  figure  and  are  labeled  at  the 
left.  The  knowledge  sources  are  indicated  by  arcs  connecting 
levels;  the  starting  point(s)  of  an  arc  indicates  the  level(s)  of  major 
"input"  for  the  KS,  and  the  end  point  indicates  the  "output”  level 
where  the  knowledge  source’s  major  actions  occur  In  general,  the 
action  of  most  of  these  particular  knowledge  sources  is  to  create 


» The  approach  taken  in  knowledge  source  decomposition  is  not 
an  attempt  to  characterize  somehow  the  overall  recognition 
process  and  then  apply  some  sort  of  traffic  flow  analysis  to  its 
internal  workings  in  order  to  decompose  the  total  process  into 
mimmally  interacting  knowledge  sources  Rather,  knowledge 
sources  are  deb  red  by  starting  with  some  ituitive  notion  about 
the  various  piecrs  of  knowledge  which  could  be  incorporated  in 
a useful  way  to  help  achieve  a solution. 
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links  oetween  hypotheses  on  its  input  level(s)  and:  1)  existing 

hypotheses  on  its  output  level,  it  appropriate  ones  are  already 
there,  or  2)  hypotheses  that  it  creates  on  its  output  levet. 

The  Segmenter  Classifier  knowledge  source  uses  the  description 
ot  the  speech  signal  to  produce  a labeled  acoustic 
segmentation  >See  Goldberg,  et  at.,  0974)  tor  a description 
of  the  algorithm  being  used.)  For  any  portion  of  the 
utterance,  several  possible  alternative  segmentations  and 
labels  may  be  produced 

The  Phone  Synthesizer  uses  labeled  acoustic  segments  to 
generate  elements  at  the  phonetic  levet  This  procedure  is 
sometimes  a fairly  direct  renaming  ot  an  hypothesis  at  the 
segmental  levet,  perhaps  using  the  context  ot  adjacent 
segments  tn  other  cases,  phone  synthesis  requires  the 
combining  ot  several  segments  (eg,  the  generahon  ot  [t] 
trom  a segment  of  silence  tollowed  by  a segment  ot 
aspiration)  or  the  insertion  ot  phones  not  indicated  directly 
by  the  segmentation  (eg,  hypothesizing  the  existence  ot  an 
[t]  if  a vowet  seems  velarized  and  there  is  no  [t]  in  the 
neighborhood).  This  KS  is  triggered  whenever  a new 
hypothesis  is  created  at  the  segmental  level. 

The  Syntactic  Word  Hvoothesizer  uses  knowledge  at  the  phrasal 
level  to  predict  possible  new  words  at  the  lexicat  level  which 
are  adjacent  (left  or  right)  to  words  previously  generated  at 
the  lexical  level.  (Rich  < 1974)  contains  a description  ot  the 
probabilistic  syntax  method  being  used  here.)  This 
knowledge  source  is  activated  at  the  beginning  of  an 
utterance  recognition  attempt  and,  subsequently,  whenever  a 
new  word  is  created  at  the  lexicat  levet. 

The  Phoneme  Hvpothesizer  knowledge  source  is  activated 
whenever  a word  hypothesis  is  created  (at  the  texicat  level) 
which  is  not  yet  supported  by  hypotheses  at  the  surlace- 
phonemic  tevel.  tts  action  is  to  create  one  or  more  sequences 
at  the  n-face-phonemic  levet  which  represem  alternative 
pronounciations  ot  the  word.  (These  pronounciations  are 
currently  pre  -specified  as  entries  in  a dictionary  ) 

The  Phone-Phoneme  Synchronizer  is  triggered  whenever  an 
hypothesis  is  created  at  either  the  phonetic  or  the  surtace- 
phonemic  level.  This  KS  attempts  to  tink  up  the  new 
hypothesis  with  hypotheses  at  the  other  level.  This  tinkmg 
may  be  many-to-une  in  either  direction. 

Figure  5 shows  the  initial  HS II -CO  knowledge  sources  ot 
Figure  4 augmented  with  tour  additional  ones  which  are  being 
implemented  or  are  planned 

The  Semantic  Word  Hvpothesizer  uses  semantic  and  pragmatic 
information  about  the  task  (news  retrieval,  in  this  case)  to 
predict  word'  at  the  lexical  level. 

The  Weed  Candidate  Generator  uses  phonetic  mtormation 
(primarily  just  at  stressed  local  ons  and  other  areas  d high 
phonetic  reliability)  to  generate  word  hypotheses.  This  witl 
be  accomplished  in  a two  stage  process,  with  a stop  at  the 
syllabic  level,  trom  which  lexical  retrieval  is  more  effective 

The  Phonological  Pule  Appher  rewrites  sequences  at  the 
surtace  phonemic  level.  This  KS  is  used:  t)  to  augment  the 
dictionary  lookup  of  the  Phoneme  Hypothesizer,  and  2)  to 
handle  word  boundary  conditions  that  can  be  predicted  by 
rule 


Levels  - - Knowledge  Sources  - 


Figure  5.  Augmented  Knowledge  Source  Set  lor  HStt-CO. 


The  primary  duties  ot  the  Segment-Phone  Synchronizer  and  the 
Parameter -Segment  Synchronizer  are  similar-  to  recover 
trom  mistakes  made  by  the  (bottom-up)  actions  ot  the  Phone 
Synthesizer  and  Segmenter-Classilier.  rest  edively,  by 
allowing  feedback  trom  the  higher  to  the  ower  tevel. 

The  introduction  ot  these  knowledge  sources  indicates  the 
modularity  and  exlendabilily  of  the  system  in  terms  ot  both 
knowledge  sources  and  lel,ets.  In  particular,  notice  that  even 
though  the  purpose  ot  some  KS  is  stated  as  "correcting  errors 
produced  by  other  knowledge  sources,"  each  KS  is  independent  ot 
the  others.  Yet  additional  knowledge  sources  will  be  added  to  the 
configuration  as  the  need  for  them  is  seen  and  as  ideas  tor  their 
implementation  are  developed 

Matching-Prototypes  soil  Associativa  Rfltriml  The 
asynchronous  processing  activity  in  HStI  is  primarity  data-dirededs 
th-s  implies  the  need  tor  some  mechanism  whereby  one  can 
retrieve  parts  ot  the  global  data  base  in  an  associative  manner. 
HStt  provides  primitives  for  associativety  searching  the  data  base 
tor  hypotheses  oatistying  specified  conditions  (eg,  tinding  atl 
hypotheses  at  the  phonetic  levet  which  contain  a vowel  within  a 
ceric  n time  range).  The  search  condition  is  specified  by  a 
matching-prototype,  which  is  a partial  specification  ot  the 
components  ot  a hypothesis.  This  partial  specification  permits  a 
component  to  be  characterized  by:  1)  a set  ot  desired  values,  or 
2)  a don't-care  condition,  or  3)  values  ot  components  ot  a 
hypothesis  previously  derived  by  matching  another  prototype  A 
matching-prototype  can  be  compared  against  a set  ot  hypotheses 
Those  hypotheses  whose  component  values  match  those  specified 
by  the  matching-proloi/pe  are  returned  as  the  result  ot  the 
search  Associative  retrieval  of  structural  relationships  among 
hypotheses  is  also  provided  by  several  primitives  More  complr 
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retrieval?,  can  bo  accomplished  by  combining  the  retrieval 
primitives  in  appropriate  ways 


Preconditions  and  Chan;e  Sets  Associated  with  every 
knowledge  source  is  a specification  ot  the  data  base  conditions 
required  for  the  nstantiation  ot  that  knowledge  source.  Such 
specifications,  called  preconditions,  conceptually  form  an  AND/OR 
tree  composed  of  matching-prototypes  and  structural  relationships 
which,  when  applied  to  the  data  base  in  an  associative  manner, 
detect  the  regions  of  the  data  base  in  which  the  knowledge  source 
is  interested  (if  the  precondition  is  capable  of  being  satisfied  at 
that  time).  Alternatively,  One  might  think  of  the  precondition 
specification  as  a procedure,  involving  matching  prototypes  and 
structural  relationships,  which  effectively  evaluates  a conceptual 
AND/OR  tree  This  procedure  may  contain  arbitrarily  complex 
decisions  (based  on  current  and  past  modifications  '0  the  data 
base)  resulting  in  the  activation  ot  desired  knowledge  sources 
within  the  chosen  contexts.  The  conte<t  corresponding  to  the 
d scovered  data  base  region  which  satisfies  some  knowledge 
source's  precondition  is  used  as  an  initial  context  in  which  to 
instantiate  that  knowledge  source  as  a new  process.  It  there  are 
multiple  regions  in  the  data  base  that  satisfy  the  specified 
conditions,  the  knowledge  source  can  be  separately  instantiated 
for  each  context,  or  once  with  a list  of  all  such  contexts. 

Whenever  any  K$  performs  a modification  to  the  global  data 
base,  the  essence  of  the  modification  is  preserved  in  a change  set 
appropriate  to  the  change  made  (eg.,  one  change  set  records 
rating  changes,  while  another  records  time  range  changes) 
Change  sets  serve  to  categorize  data  base  modifications  (events) 
and  are  thus  useful  in  precondition  evaluation  since  they  limit  the 
areas  in  the  data  base  that  need  to  be  examined  in  detail.  As 
currently  implemented,  precondition  evaluators  exist  as  a set  ot 
procedures  which  monitor  changes  in  the  data  base  (i.e.,  they 
monitor  additions  to  those  change  sets  in  wmch  they  are 
interested).  These  precondition  procedures  are  themselves  data- 
directed  in  tbs'  ‘hey  are  applied  whenever  sufticient  changes  have 
been  made  in  the  global  data  base.  In  etfect,  the  precondition 
procedures  themselves  have  preconditions,  albeit  of  a much 
simpler  form  than  those  possible  tor  knowledge  sources.  Tor 
example,  a precondition  procedure  may  specify  that  it  should  be 
invoked  (by  a system  precondition  invoker)  whenever  changes 
occur  to  two  adjacen,  hypotheses  at  the  word  level  or  whenever 
support  is  added  to  the  phrasal  level.  By  using  the  (coarse) 
classifications  afforded  by  change  sets,  the  system  avoids  most 
unnecessary  data  base  examinations  by  the  precondition 
procedures.  The  major  point  to  be  made  is  that  the  scheme  of 
precondition  evaluation  is  event-driven,  being  based  on  the 
occurrence  of  changes  in  the  glob; I data  base  In  particular, 
precondition  evaluators  are  not  invoiced  in  a form  of  busy  waiting 
in  hich  they  are  constantly  looking  for  something  that  is  not  yet 
there 

Once  invoked,  a ptecondition  procedure  uses  sequences  of 
associative  retrievals  and  structural  matches  on  the  data  base  in 
an  attempt  to  establish  a context  satistying  the  preconditions  of 
one  or  more  ot  "its"  knowledge  sources;  any  given  precondition 
procedure  may  be  responsible  for  instantiating  several  (usually 
reiated)  knowledge  sources.  Notice  that  the  data-directed  niture 
of  precondition  evaluation  and  knowledge-source  instantiation  is 
linked  closely  to  the  primiti  e functions  that  are  able  to  modity  the 
data  base,  for  it  is  only  at  points  of  modification  that  a 
precondition  that  wa  unsatisfied  betore  may  become  satisfied. 
Hence,  data  ba'e  modificatior-  routines  have  the  responsibility 
(although  perhaps  indirectly)  of  activating  the  precondition 
evaluation  mechanism.* 

p.  IB  - 


Multiprocessing  Considerations 

Some  Problems  and  Their  Solutions 

A primary  goal  in  the  design  ct  HS II  is  to  exploit  the 
potential  parallelism  of  the  Hearsay  system  model  as  fully  as 
possible  Several  issues  associated  with  the  introduction  ot 
parallel  knowledge-source  processes  into  HSU  will  be  described 
and  their  current  solutions  outlined. 

Local  Contexts.  A precondition  evaluator  (process)  is 
invoked  based  on  the  occurrence  of  certain  changes  which  have 
taken  place  in  the  global  data  base  since  the  last  time  the 
evaluator  was  invoked;  these  changes,  together  with  the  state  ot 
the  relevant  parts  of  the  global  data  base  at  the  instant  at  which 
the  precondition  evaluator  is  invoked,  form  a locat  context  within 
which  the  evaluator  operates.  Conceptually,  at  the  instant  ot  its 
invocation,  th?  precondition  evaluator  takes  a snapshot  Ot  the 
global  data  base  and  saves  the  substance  of  the  changes  that  have 
occurred  to  that  moment,  thereby  preserving  its  local  context 

The  necessity  Of  preserving  this  local  context  exists  for 
several  reasons:  1)  HS II  consists  of  asynchronous  processes 

sharing  a common  global  data  base  which  contains  only  the  most 
current  data  (that  is,  no  history  ot  data  modification  is  preserved 
in  the  global  data  base),  2)  since  the  precondition  evaluators  are 
also  executed  asynchronously,  each  evali  ator  is  interested  only  in 
changes  in  the  data  base  which  have  occurred  since  the  last  time 
that  particular  evaluator  was  executed  (that  is,  the  relevant  set  of 
changes  for  a particular  precondition  evaluator  is  time-dependent 
on  the  last  execution  of  that  evaluator),  and  3)  further 
modifications  to  the  global  data  base  which  are  of  interest  to  a 
given  knowledge-source  process  may  occur  between  the  time  of 
that  knowledge-source  process’s  instantiation  and  the  time  of  its 
completion  (in  particular,  such  modifications  and  their  relationship 
to  data  base  values  which  existed  at  the  time  ot  the  instantiation 
ot  the  knowledge-source  process  may  influence  the  computation  of 
that  knowledge-source  process).  Hence,  the  problem  of  creation 
Of  local  contexts  exists,  as  do  the  associated  problems  of  signalling 
a knowledge-source  process  when  its  local  context  is  no  longer 
valid  and  of  updating  these  contexts  as  further  changes  occur  in 
the  global  data  base. 

Consider  the  follrwing  time  sequence  of  events: 


* One  might  thins  ot  HSU  as  a production  system  (Newell,  1973) 
which  is  e ecuted  asynchronously  The  preconditions 
correspond  to  the  left-hand  sides  (conditions)  of  productions, 
and  the  knowledge  sources  correspond  to  the  right  hand  sides 
(actions)  of  the  productions.  Conceptually,  these  left-hand  sides 
are  evaluated  continuously.  When  a precondition  is  satisfied,  an 
instantiation  of  the  corresponding  right-hand  side  of  its 
production  is  created;  this  instantiation  is  executed  at  some 
arbitrary  subsequent  time  (perhaps  subject  to  instantiation 
scheduling  constraints). 
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Tli  start  precondition  evaluator  PRE 
(triggered  by  data  base  changes) 

T2:  PRE  instantiates  a knowledge-source  process  KS 

T3  end  PRE 

T4  start  KS 

T5;  after  KS  i evaluation  ot  | recondition 
<computation> 

T6  KS  modities  global  data  base 
<computation> 

T7:  KS  modities  global  data  base  again 

T8:  end  KS 

PRE  is  activated  to  respond  to  changes  occurring  in  the 
global  data  base  PRE  should  execute  in  the  context  ot  changes 
existing  at  time  TI  (since  that  context  contains  the  changes  which 
caused  PRE  to  be  activated).  KS  is  instantiated  (readied  tor 
running)  at  T2  due  to  further  conditions  PRE  discovered  about  the 
change  context  of  Tl.  Hence,  PRE  should  pass  'lie  context  of  Tl 
as  the  initial  environment  in  which  to  run  KS. 

By  time  T4,  when  KS  actually  starts  to  execute,  other 
changes  could  have  occurred  in  the  6i<  bal  dat -s  base  due  to  the 
actions  of  other  knowledge-source  f 'ocesses.  So  KS  should 
examine  these  new  updating  changes  (those  occurring  between  Tl 
and  T4)  and  revalidate  its  precondition,  it  necessary.  Alter 
revaluation,  KS  assumes  the  updated  context  of  T5,  and  it 
proceeds  to  base  its  computation  on  the  context  ot  changes  as  ot 
T5. 

When  KS  wishes  to  perform  an  actual  update  of  elements  ot 
the  global  data  base  at  T6,  it  must  examine  the  changes  to  the 
global  data  base  that  occurred  between  T5  and  T6  to  see  if  any 
other  knowledge-source  processes  may  have  violated  KS’s 
preconditions,  thereby  invalidating  its  computations.  Having 
performed  this  revalidation  and  any  data  base  updating,  KS  should 
upda'e  its  context  to  reflect  changes  up  to  TO  tor  use  in  its 
further  computation.  At  T7,  KS  must  look  for  further  possible 
invalidations  to  its  most  recent  computations,  due  to  possible 
changes  in  the  global  data  base  by  other  knowledge-source 
prnroc-r.,  uuring  the  time  period  T6  to  T7.  Wien  KS  (which  is  an 
instanti;  > ot  some  rnow'edge  source)  completes  its  actions  at 
T8,  its  local  context  may  be  deleted. 

Changes  occurring  between  instantiations  of  a knowledge 
source  are  accumulated  in  the  local  contexts  ot  the  precondition 
evaluators  and  may  become  part  of  the  local  context  ot  a tuture 
instantiation  of  a knowledge  source.  Thus,  the  precondition 
evaluators  are  always  collecting  data  base  changes  (since  these 
evaluators  are  permanently  instantiated),  while  individual 
knowledge  source  instantiations  accumulate  data  base  changes 
only  during  their  transient  existence 

In  practice,  to  create  local  contexts  one  need  only  save  the 
contents  of  changes  which  occur  in  the  global  data  base  (thereby 
allowing  the  global  data  base  to  contain  only  the  very  latest 
values)  In  particular,  no  massive  vOpy  operations  involving  the 
global  data  base  are  required.  Thus,  tor  each  data  base  event 
caused  by  a modification  primitive,  the  associated  change*  is 


* The  intormation  which  detines  the  change  consists  ot  the  locus 

ot  the  change  (i.e.,  a node  name  and  a tield  name),  the  old  value 
ot  the  tield,  the  revised  value  ot  the  tield,  the  name  of  the 
changer  (the  unique  knowledge-source  instantiation  name),  and 
the  system  time  of  the  change 


distributed  (copied)  into  the  local  contexts  (which  can  now  be 
characterized  as  local  change  sets,*  reterring  to  the  previous 
discussion  on  change  sets)  of  all  knowledge-source  processes  and 
precondition  evaluators  who  cart  Notice  that  not  f ery 
knowledge-source  process  and  precondition  evaluator  cares  about 
every  change  that  takes  place  For  example,  a tricative  detector 
will  not  care  about  a data  base  change  associated  with  the 
grouping  of  several  words  to  form  a syntactic  phrase,  but  it  is 
interested  in  the  hypethesization  of  a word  whose  phonemic 
spelling  contains  a tricative. 

In  order  tor  a knowledge-source  (or  precondition  evaluator) 
process  to  -eceive  changes  which  occur  to  particular  fieics  of 
particular  nodes  which  are  in  its  local  context,  those  tields  need  to 
be  tagged  with  that  knowledge-source  instantiation’s  unique  name 
Then,  whei  ever  a modification  is  done  to  the  global  data  base,  a 
message  sig  calling  the  change  is  sent  to  all  who  have  tagged  the 
tield  being  changed.  In  addition,  tho  contents  ot  the  change  are 
also  distributed  to  the  local  contexts  of  those  knowledge  sources 
receiving  the  message.  This  data  tield  tagging  may  be  requested 
by  either  a precondition  evaluator  which  is  about  to  instantiate  a 
knowledge  source  based  on  the  values  of  particular  fields  (which 
represent  the  context  satistying  the  precondition),  or  by  a 
knowledge-source  process,  once  instantiated,  which  may  request 
additional  tields  to  be  tagged. 

For  example,  in  its  search  through  the  global  data  base  for 
conditions  satistying  the  preconditions  ot  some  knowledge  source, 
a precondition  evaluator  may  accumulate  reterences  to  the  data 
tields  which  it  has  examined;  and  when  the  entire  precondition  has 
been  found  'o  be  satisfied,  the  precondit  on  evaluator  tags  those 
fields  (tor  which  it  has  accumulated  reterences)  with  the  nare  of 
the  knowledge-source  process  it  is  about  to  instantiate.  Similarly, 
having  been  invoked,  a knowledge-source  process  might  wish  to 
do  certain  computations,  but  only  it  certain  data  tields  are  not 
altered;  the  knowledge-source  process  itselt  can  tag  these  fields 
and  thereby  be  informed  ot  subsequent  tampering  with  the  tagged 
fields,  The  union  cf  these  tagged  tields  forms  a critical  set 
(specitying  the  tields  ot  the  locJ  context  for  the  knowledge- 
source  process)  which  is  locked  (see  below)  every  time  the 
knowledge-source  process  wishes  to  modify  the  global  data  base. 
Thus,  after  having  locked  the  critical  set  and  prior  to  performing 
any  update  operations,  the  knowledge-source  process  can  check 
to  see  whether  any  other  knowledge-source  process  has  made 
any  changes  which  might  invalidate  the  current  knowledge 
source’s  assumed  context  (and  hence,  perhaps,  invalidate  its 
proposed  update).**  For  example,  if  a knowledge-source  process 
is  veritying  a hypothesis  in  the  data  base  because  its  rating 
exceeds  50  (i.e.,  the  rating  value  represents  par1  of  the  local 
context  for  the  knowledge-source  process),  then  before 


* An  alternative  to  replicating  the  change  intormation  for  the 
various  knowt"dge-source  processes  is  to  mainiain  a single 
central  copy  ot  those  data,  passing  only  reterences  to  the 
centralized  items  to  the  various  local  contexts.  The  individual 
change  items  may  then  be  deleted  trom  their  central  change 
sets  whenever  there  are  no  further  outstanding  references  to 
that  change. 

**  The  characterization  ot  local  contexts  according  to  specific  data 
tields  (which  is  made  possible  in  part  by  the  choice  of  levels  ir 
the  global  data  base)  helps  to  minimize  the  overhead  associated 
witn  context  maintenance.  Also,  the  smaller  the  context,  the 
more  tlexible  the  scheduling  strategy  may  be  (since  it  needs  to 
be  less  concerned  with  the  time  requireo  tor  a context  swap  on 
a processor). 
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per'orming  any  modifications  on  the  data  base  which  depend  or, 
that  assumption,  the  knowledge-source  process  should  check  to  be 
sure  that  no  othe'  knowledge-source  process  has  invalidated  the 
rating  assumption  in  the  meantime. 

QdLl  Base  Deadlock  Prevention.  Any  knowledge-source 
process  may  request  evdusive  access  to  some  collection  of  fields 
af  any  time  Thus,  unless  some  care  is  taken  in  ordering  the 
requests  for  such  evdusive  access,  the  possibility  of  getting  into  a 
deadlock  situation  evists  (where,  for  example  one  knowledge- 
source  process  is  waiting  for  evdusive  access  to  a field  already 
held  evdusively  by  a second  knowledge-source  process,  and  vice 
YSLSA,  resulting  in  neither  process  being  abL  to  proceed). 
Applying  a linear  ordering  to  the  set  of  lockable  fields  and 
requesting  evdusive  access  according  to  that  ordering  is  a 
commonly-used  means  Of  deadlock  avoidance  in  resource 
allocation.  However,  this  technique  works  only  if  all  the  resources 
(fields)  to  be  locked  are  known  ahead  of  time  The  ability  to  tag  a 
data  field  allows  a knowledge -source  process  to  locate  and 
evarmne  in  arbitrary  order  the  set  of  hypotheses  that  will  form  the 
contevt  for  a data  base  modification  and  then,  only  after  the  entire 
set  of  desired  hypotheses  (and  links)  has  been  determined,  to  lock 
the  desired  set  and  perform  the  modi'ication 

To  eliminate  the  possibilities  of  deadlock,  the  actual  locking 
operation  is  relegated  to  a system  primitive,  and  a knowledge- 
source  process  is  required  to  present  to  the  locking  primitive  the 
entire  set  of  fields  to  which  it  wants  evdusive  access  This  set  is 
then  evtended  to  include  all  fields  in  the  critical  set  of  the  calling 
knowledge-source  process,  for  the  reasons  relating  to  contevt 
revaluation  given  above  The  system  locking  primitive  can  then 
request  evdusive  access  for  this  union  of  data  fields,  01  behalf  of 
the  knowledge-source  process,  according  to  a universal  ordering 
scheme  (such  an  ordering  being  possible  since  every  node  in  the 
global  data  base  essentially  has  a unique  serial  number)  which 
assures  that  no  deadlocks  occur.  Having  been  granted  evdusive 
access  to  all  desired  fields  at  once,  the  knowledge-source  process 
may  first  check  to  see  whether  there  have  been  any  changes  to 
the  tagged  data  fields,  tf  there  have  been  none,  it  can  proceed  to 
perform  its  modifications  'which  modifications  a e sent  to  the  local 
contevts  of  others  who  c re  about  such  things)  ,'owever,  if  there 
were  changes,  the  knowledge-source  process  can,  after  evamining 
the  changed  data  fields,  decide  whether  it  still  wants  to  perform 
the  modifications.  When  the  knowledge -source  process  is  finished 
updating  the  data  base,  it  releases  all  its  locked  fields  by 
evecuting  a system  primitive  unlocking  operator.  In  particular,  the 
system  ensures  that  a knowledge -source  process  will  not  request 
two  lock  operations  without  issuing  an  intervening  unlock 
operation.  In  this  manner,  any  possibility  of  a data  base  deadlock 
is  eliminated. 

Goat-Directed  Scheduling.  The  computational  sequence  of  a 
typical  HSlt  run  may  be  haraderized  by  considering  the  states  of 
the  utterance  recognition  at  any  particular  data  base  level.  Tor 
evample,  if  one  considers  the  efforts  in  producing  the  word  level 
and  traces  the  development  of  the  "best"  partial  sentences,  the 
processing  that  will  have  been  done  is  analogous  to  a general 
tree-search,  where  each  node  of  the  tree  represents  some 
parhally  completed  sentence  (with  the  eventual  resultant  sentence 
being  one  of  the  leaves  of  the  tree).  The  problem  now  is  to  guide 
this  tree  search, ng  so  as  to  find  the  answer  leaf  in  an  optimal  way 
(according  to  some  measure  of  optimality).  To  achieve  this  end, 
ratings  are  a-'Ociafed  with  every  hypothesis  and  link  in  the  global 
data  ha  e (and  thus  with  every  partial  sentence  node  of  the 
analogous  sea  ch  tree).  Using  these  ratings  (which  are  effectively 
evaluation  functions  indicating  the  goodness  of  the  results  of  the 


work  done  so  fa'  with  rc-pect  fo  a given  partial  sentence),  one 
may  employ  various  tree  searching  strategies  >o  advance  the 
search  in  some  optimal  manner. 

To  complicate  mailers  further,  however  HSII  is  intended  !o 
be  a multiprocess-oriented  system.  T lerefore,  schemes  must  be 
devised  for  effectively  searching  a problem-solving  graph  using 
parallel  processes,  since  one  can  conceive  of  pursuing  several 
branches  Of  the  search  graph  in  parallel  by  asynchronously 
instantiating  various  knowledge -source  processes  to  evaluate 
various  alternative  paths.  Ou?  must  also  take  into  consideration 
the  underlying  hardware  architecture  The  physical  placement  of 
the  global  data  base  and  the  knowledge-source  processes  will 
have  a very  definite  influence  on  the  scheduling  ph.losophy  chosen 
and  the  resultant  system  efficiency. 

To  aid  in  making  scheduling  decisions,  one  ,iay  associate 
with  every  node  in  the  global  data  base  some  attention  focusing 
factors,  which  are  indicators  tolling  how  much  effort  has  been 
devoted  to  processing  in  this  area  of  the  search  tree  and  how 
desirable  it  is  to  devote  further  effort  to  this  section  of  the  tree 
Such  attention  focusing  factors  may  also  be  associated  with 
various  speech  time  regions  to  indicate  interest  in  doing  further 
processing  on  certain  regions  of  the  utterance,  regardless  of  any 
particular  information  isvel.  Furthermore,  attention  focusing 
factors  are  associated  w ih  every  data  base  modification,  thereby 
distributing  aPentton  focu  ing  factors  to  the  various  change  sets 
which  constitute  the  local  contexts  of  the  processes  in  the  system 
The  scheduler  is  one  such  process  which  might  be  especially 
interested  in  such  focusing  factors,  as  will  be  described  below. 
The  use  of  the  various  ratings  ai  attention  focusing  factors 
allows  Hbll  to  perform  goal-directed  scheduling,  which  is  pro  ess 
scheduling  so  as  to  achieve  "optimal"  recognition  efficiency  The 
thrust  of  goal-direcfed  scheduling  is  that,  while  there  may  be 
many  processes  ready  to  'ui  and  work  on  various  parts  of  the 
search  tree,  one  should  first  schedule  those  processes  which  can 
best  help  to  achieve  the  goal  of  utterance  recognition.  Notice  that 
such  search-tree  pruning  techniques  as  the  alpha-beta  procedure 
(which  is  essentially  a sequential  algorithm  anyway)  do  not  apply 
to  HSIl's  non-game  search  trees,  which  do  not  have  the  constraint 
that  alternating  levels  of  the  tree  represent  the  moves  of  an 
opponent. 

Goal-directed  scheduling  may  be  viewed  as  having  two 
separate  functions:  1)  using  the  ratings  and  attention  focusing 
factors  associated  with  the  global  data  base  components  to 
schedule  hngwledee-source  processes  which  have  been  invoked 
(readied  for  execution)  in  response  to  previous  events  detected  in 
the  global  data  base,  and  2)  using  these  same  attention  focusirg 
factors  to  detect  important  areas  in  the  global  data  base  which 
require  further  work,  and  invoking  precondition  evaluators  as  soon 
as  possible  to  instantiate  new  knowledge-source  proce-ses  to 
work  in  those  important  areas  Thus,  the  attention  focusing 
factors  within  the  global  data  base  serve  to  schedule  both 
knowledge  source  processes  and  precondition  evaluators 

A knowledge  source  process  might  be  scheduled  for 
execution  because  it  possesses  the  only  processing  capability 
available  to  be  applied  to  an  important  unexplored  area  of  the 
data  base  However,  if  there  are  many  such  processes  ready  to 
execute,  the  scheduler  can  perform  a type  of  means-ends  analysis 
in  which  those  knowledge-source  processes  are  scheduled  which 
are  likely  to  produce  data  base  changes  m wh  ch  the  system 
currently  interested  (such  interest  being  noted  by  high  attentior 
focusing  factors  in  a given  change  set)  For  example,  if  the  data 
base  contains  focusing  factors  which  highlight  activity  in  a time 
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rrgion  in  which  there  are  no  structural  connections  between  two 
odjoming  levels,  th°  scheduler  would  probably  give  a higher 
priority  to  a knowtedge-soi.'ice  process  which  will  attempt  (as 
indicated  in  its  external  specifications)  to  make  such  a connection 
than  to  a knowledge -source  process  which  is  likely  merely  to 
peri^'m  a minor  refinement  on  the  ratings  in  one  Of  the  levels. 

Another  means  of  effecting  goal  directed  scheduling  relates 
to  the  attention  focusing  factors  associated  with  various  time 
regions  of  the  utterance  (such  focusing  factors  reaching  across  all 
the  information  levels  of  the  global  data  base).  Using  these  time 
region  focusing  factors,  one  can  schedule  knowledge-source 
processes  which  can  contribute  in  a particular  time  region,  or 
invoke  precondition  evaluators  to  instantiate  some  new 
knowledge-source  processes  to  work  within  the  desired  time 
region. 

SUMMARY,  CURRENT  STATUS,  AND  PLANS 

In  summary,  HS1I  is  a system  organization  for  speech 
understanding  that  permiL  the  representat  on  vf  speech 
knowledge  in  terms  of  a large  number  of  dive-se  knowledge 
sources  which  coopeiate  via  a generalized  (in  terms  of  both  data 
and  control)  form  of  the  hypothesize-and-test  paradigm. 
Knowledge  sources  are  independent  and  separable;  they  are 
activated  in  a data  directed  manner  and  execute  acynchronousty, 
communicating  information  among  themselves  through  a global  data 
base.  This  global  data  base,  wh'ch  is  a representation  of  the 
partial  analysis  of  the  utterance,  is  a three-dimensionat  data 
structure  (in  which  the  dimensions  are  tevel  of  representation, 
time,  and  alternatives)  augmented  by  AND/OR  structural 
relationships  which  interconnect  elements  of  the  data  structure. 
This  global  data  base  structure  permits  information  generated  by 
one  knowtedge  source  to  be;  1)  retained  for  use  by  other 
knowledge  sources,  and  2)  quickly  propagated  to  other  relevant 
parts  of  the  data  base.  In  addition  to  being  a new 
representational  framework  for  specifying  speech  knowlege,  HS11 
is  a system  organization  suitable  for  efficient  implementation  on  a 
multiprocessor  computer  system.  In  particular,  the  system 
organization  employs  techniques  which,  while  not  violating  the 
independence  and  modularity  of  knowtedge  sources,  permits;  1) 
avoidance  of  ueadlock  in  the  data  base,  2)  efficient  implementation 
of  data-directed  sequencing  of  knowtedge  sources,  and  3)  goal- 
directed  schednii'"}  ot  asynchronously  executable  knowledge- 
source  processes. 

A preliminary,  synchronous  version  of  HSU  has  been 
operating  on  CMU’s  PDP-10  since  January,  1974.  The  fully 
asynchronous,  mutt  process  version  of  HSU  is  now  in  the  final 
stages  of  being  implemented,  also  on  the  PDP-10,  and  is  expected 
to  be  running  by  May,  1974.  This  multiprocess  version  of  HS11  will 
also  contain  the  capability  of  simulating  the  effect  of  operating 
HSU  in  a multiprocessor  environment  Experience  with  this 
multiprocess  version  of  HSII,  together  with  simulation  data  on  the 
effects  of  operating  in  a multiprocessor  environment,  w'll  form  thn 
basis  for  a multi  irocessor  version  of  HSII  on  C.mmp.  An  initial 
implementation  of  HSII  on  C.mmp  is  expected  to  be  completed  in 
the  first  quarter  of  1975. 

ACKNOWLEDGMENTS 

We  wish  to  acknowledge  the  contributions  of  the  following 
people  Atlen  Newell  for  his  insights  int  o the  task  of  problem- 
solving, David  Parnas  and  Lee  Cooprider  for  their  discussions  on 
system  design  and  decomposition,  Gregory  Gill  for  his  untiring 
efforts  in  system  implementation,  and  Donald  McCracken  for  his 
valuable  criticism  of  the  ideas  as  presented  in  this  paper. 

p,  21  - 


BIBLIOGRAPHY 


Hell,  C.  G„  W.  Uroadley,  W.  Wull,  A Newell,  et  al.,  ( 1 97  1 ),  "C.mmp 
The  CMU  Mutti  mini-processor  Computer,"  Tech  Rep.,  Comp 
Sci.  Dept.,  Carnegie -Melion  Umv.,  Pittsburgh,  Pa. 

Bell,  C.  G.,  R C Chen,  S.  H.  Fuller,  J.  Grason,  S.  Rege,  and 
D.  P.  Sieviorek  (1973),  "the  Architecture  and  Applic**  on 
of  Computer  Modules:  A Set  of  Components  for  Digital 
Systems  Design"  COMPCON  73,  San  Francisco,  Ca,  po  177- 
180. 

Erman,  L.  D , R.  D.  Fennetl,  V R Lesser,  and  D.  R Reddy  (1973), 
"System  Organizations  for  Speech  Undersi  >ding 
Implications  of  Network  and  Multiprocessor  Cor,,pjter 
Architectures  for  Al,"  Proc.  3rd  Inter  Joint  Conf  on 
Artificial  Intel , Stanford,  Ca.,  pp  194-199 

Goldberg,  H.  G.,  P R.  Reddy,  and  R.  L.  Suslick  (1974),  "Parameter- 
Indepenuent  Machine  Segmentation  and  Labeting"  Proc. 
IEEE  S/mp.  Speech  Recognition,  Pittsburgh,  Pa.,  pp.  106- 
1 1 1,  (tms  volume). 

Hewitt,  C.  (1972),  “Description  and  Theoretical  Analysis  (Using 
Schemata)  of  Planner  A Language  for  Proving  Theorems 
and  Manipulating  Models  in  a Robot"  Al  Memo  No  251,  MIT 
Project  MAC. 

Knudsen,  M.  J.  (1974),  "Real-time  Linear-predictive  Coding  of 
Speech  on  the  SPS-41  Microprogrammed  Tripte-processor 
System”  Prnc.  IEEE  Symp.  Speech  Recognition,  Pittsburgh, 
Pa.,  pp.  274-277,  (this  volume). 

Kriz,  S.  (1L'74),  "A  16-bit  A-D-A  Conversion  System  for  High- 
Fidelity  Audio  Research  Proc  IEEE  Symp.  Speech 
Recognition,  Pittsburgh,  Pa.,  pp.  278-282,  (this  volume). 

Newell,  A.,  J.  Barnett,  J.  Forgie,  C.  Green,  D.  Ktatt,  J.  C.  R.  Licklider, 
J.  Munson,  R.  Reddy,  and  W.  Woods  (1971),  Speech 
Understanding  Systems:  Final  Report  ot  a Study  Group,  pub. 
by  North- Holtand  (1973). 

Newetl,  A.  (1973),  "P-oduction  Systems:  Models  of  Control 

Structures"  in  W.  C.  Chase  (ed.)  Visual  Intormation 
Processing,  Academic  Press,  pp.  463-526. 

Reddy,  D.  R.,  L.  D Erman,  and  R.  B.  Neely  (197fl',  "The  C-MU 
Speech  Recognition  Project/  Proc.  IEEE  System  Sciences 
and  Cybernetics  Conf.,  Pittsburgh,  Pa 

Reddy,  D,  Raj,  Lee  D.  Erman,  and  Richard  B.  Neely  (1973a\  "A 
Model  and  a Syst  ’ for  L’achine  Recognition  of  Speech," 
IEEE  Trans  Audio  ai.u  Elect  uacoustics,  AU-21,  3,  June,  1973, 
pp  229-238. 

Reddy,  D.  R.,  L.  D.  Erman,  R.  D Fennett,  and  R B.  Neely  (1973b), 
"The  HEARSAY  Speech  Understanding  oystem-  An  Example 
of  the  Recognition  Process/  Proc  3rd  Inter.  Joint  Conf.  on 
Artificial  Intel.,  Stanford,  Ca.,  pp  185-193. 

Reddy,  R.,  and  Newell  (1974),  "Knowledge  and  its  Representation 
in  a Speech  Understanding  System/  in  L W Gregg  (ed  ) 
Knowledge  and  Cognition,  Lawrence  Erlbaum  Assoc, 
Washington,  D.  C.,  chap.  10,  (in  press). 

Rich,  E.  (1974),  "Inference  and  Use  o‘  ipie  Predictive  Grammars/ 
Proi.  IEEE  Symp.  Speech  Recognition,  Pittsburgh,  Pa., 
p.  242,  (this  volume). 

Rulifson,  J.  F.,  et  al.,  (1S73I,  "QA4  A Procedural  Calcutus  for 
Intuitive  Reasoning/  Technical  Note  73.  Stanford  Res  Inst , 
Al  Center. 

Shockey,  L.  and  L.  D Erman  (1974),  "Sub  Lexical  Levels  in  the 
Hearsay  11  Speech  Understanding  System/  Proc  IEEE  Svmp 
Speech  Recrgmtion,  Pittsburgh,  Pa.  pp.  208  210,  (this 
volume). 


April,  1974  (CMU) 


M2 


IEEE  Symp  Speech  Recognition 


— I*— 


THE  DRAGON  SYSTEM  AN  OVERVIEW 
Jarr.es  K.  Baker 

Computer  Science  Department 
Carnegie-Melloi  University 
Pittsburrh,  Pa. 


ABSTRACT 


This  paper  briefly  describes  the  major  features  of  the 
DRAGON  speech  understanding  system.  DRAGON  makes 
systematic  use  of  a general  abstract  model  to  repesent  each  of 
the  knowledge  sources  necessary  for  automata  recognition  of 
continuous  speech  The  model— that  of  a probabilistic  function 
of  a Markov  process— is  very  flexib  e and  leads  to  features 
which  allow  DRAGON  to  function  despile  high  error  rates  Irom 
individual  knowledge  sources.  Repeated  use  of  a simple 
abstract  model  produces  a system  which  is  simple  in  structure, 
but  powerful  in  capabilities. 


To  achieve  reliable  speech  recognition  it  is  necessary 
to  combine  information  from  a variety  of  sources([4]).  In 
addition  to  the  direct  acoustic  information,  valuable  sources 
include  the  vocabulary,  the  grammar,  and  the  semantics  o>  the 
utterance.  Extracting  the  information  from  each  of  these 
sources  of  knowledge  is  a difficult  task  and  the  need  to 
coordinate  the  various  pieces  of  information  makes  the  task 
even  more  difficult.  For  the  DRAGON  system  a general 
theoretical  model  has  been  adapted  to  represent  each  of  the 
important  sources  of  knowledge.  The  sources  of  knowledge 
are  organized  into  a hierarchical  system  such  that  the 
integrated  system  is  again  an  instance  of  the  same  general 
model.  The  availabiluy  of  this  general  t heretic al  framework 
has  greatly  sirrM:;‘*>H  the  DRAGON  speech  understanding 
system. 

The  general  model  which  is  used  throughout  the 
DRAGON  system  is  that  of  a probabilistic  function  of  a Markov 
process([2])  In  this  model  there  are  two  sequences  of  random 

variables  X(  1 ),  X(2),  X(3) X(T),  and  Yd),  Y(2),  Y<3> Y(T). 

The  X’s  correspond  to  internal  stales  which  are  not  observed 
and  the  Y’s  correspond  to  external  observations  whose 
distributions  oepend  on  the  values  of  the  X’s.  For  example, 
the  X’s  could  represent  the  sequence  of  phones  in  an 
utteranc  and  the  Y’s  could  represent  the  sequence  of  acoustic 
measurements  Alternatively,  the  X’s  could  be  the  sequence  of 
words  in  an  utterance  and  the  Y's  could  represent  the 
sequence  of  phones  and  modifiers  as  the  words  are  actually 
pronounced.  Changing  ,uu  frame  of  refe  ence  again,  the  Y’s 
could  represent  the  words  of  the  utterance  and  the  X’s  could 
represent  the  underlying  sequence  of  syntactic -semantic 
states. 


The  major  features  of  the  DRAGON  system  are 

( 1 ) Delayed  decisions 

(2)  Generative  form  of  model 

(3)  Herarchical  system 

(A)  Integrated  representation 
(5)  General  theoretical  framework 


The  various  sources  of  knowledge  are  organized  into  a 
hiera* chy  of  probabilistic  functions  of  Markov  processes.  A 
network  is  constructed  to  provide  an  integrated  representation 
of  the  hierarchy.  Recognition  o‘  an  utterance  corresponds  to 
finding  an  optimum  path  through  the  network.  The  optimum 
path  is  found  by  an  algori  hm  which,  in  effect,  explores  all 
possible  patfis  in  parallel([i ]j. 


In  teims  of  the  network  representation,  most  speech 
recognition  algorithms  search  for  a suboptimum  path  though 
the  network.  A globally  optimum  palh  woulo  clearly  be 
superior,  but  with  most  models  it  is  prohibitively  expense  to 
compute.  The  Markov  model  of  the  DRAGON  system  permits 
such  a globally  optimum  path  to  be  found  by  an  algorithm  such 
that  the  number  of  computatons  is  linear  in  the  length  of  the 
utterance 

The  Markov  assumption  is  a prescription  to  include  "all 
relevant  context"  in  formulating  the  state  space  of  the  process. 
By  considering  at  each  point  in  time  all  possible  internal  states, 
the  algorithm  searchs  all  possible  paths  through  the  network. 
By  combining  paths  when  and  only  when  tney  come  to  the 
same  state  at  the  same  time,  all  decisions  are  delayed  until  the 
full  effect  of  all  context,  past  and  future,  has  been  considered. 


Generative 


By  having  an  external  sequence  (Y)  depend 

probabilistically  on  un  unobserved  internal  sequence  (X),  the 
system  allows  knowledge  sources  to  be  represented  in  a 
generative  form([6]).  Given  the  sequence  of  syntactic- 
semantic  states  one  can  generate  the  words.  Given  the  words 
one  can  generate  the  phones.  Given  the  sequence  of  phones 
one  can  generate  the  sequence  of  acoustic  observations.  But, 
computationally,  this  hierarchy  of  conditional  probabilities  can 
be  reversed  by  applying  Bayes’  theorem.  In  analyzing  a 
specific  utterance  one  can  proceed  from  the  known 
observations  to  the  internal  states  which  must  be  inferred. 


The  sources  of  knowledge  are  organized  info  a 
hierarchy  based  on  the  following  observation:  The  "top"  levels 
of  a speech  recognition  system  change  state  less  frequent. y 
than  the  bottom"  levels.  Thus  a single  syntactic-semantic 
state  corresponds  to  a sequence  of  ‘veral  words;  a $ ngie 
word  corresponds  to  a sequence  of  everal  phones;  and  a 
phone  corro  ponds  to  a sequence  of  several  acoustic  events 
The  hierarchy  is  not  absolute— for  example,  syntax  and 
semantics  are  mixed  together  into  a single  mult  -level  process- 
-but  it  provides  a convenient  means  for  combining  the  Markov 
processes  which  represent  the  individual  sources  of 
knowledge. 

Inlc&r filed  resentahon 

A network  is  constructed  which  represents  the  total 
hierarchy  of  Markov  processes  The  process  as  a whoie  f ts 
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the  same  general  model  as  the  pieces--it  is  a probabilisti 
function  of  a Markov  process.  All  of  Ihe  "knowledge"  of  the 
system  is  represented  in  a pair  of  simple  data  structures:  the 
transition  matrix  of  the  network  and  the  table  of  conditional 
probabilit  es  connecting  internal  states  to  external 
observations  The  mam  program  of  the  system  is  based  on  the 
general  model  of  a p obabilistic  function  of  a Markov  process. 

All  speech-specific  knowledge  is  represented  in  the  data 
structures,  not  in  the  program. 

Onerai  Theoretical  Framework 

Havirg  a general  theoretical  structure  greatly  simplifies 
the  speech  recognition  system.  It  is  both  easier  tn  implement 
and  easier  to  understand  Its  operations  can  be  expressed 
explicitly  by  a simple  set  of  mathematical  equations  A 
powerful  gener  I system  is  constructed  by  repeated  use  of  a 
flexible  theoretical  model. 

Potential  Prnhlems  and  Disadvantages 

Delayed  decisions--searchmg  al  possible  paths  through 
the  network--could  lead  to  a combir  tonal  explosion  in  the 
number  of  computations.  The  Mamov  model  completely 
prevents  this  combinatorial  explosion.  Alternate  paths  are 
recombined  at  exactly  the  same  rate  that  new  branches  arp 
formed.  The  total  number  of  computations  is  linear  in  the 
length  of  the  utterance 

The  integrated  representation  of  a hierarchical  system 
could  result  in  an  excessively  large  state  space  Care  must  be 
exercised  as  to  what  context  must  be  included  and  what  can 
be  safely  ignored.  Experience  indicates,  however,  that  the 
network  repr  jsentation  is  a compact  and  powerful 
represent,'  on  and  speech  recognition  tasks  with  large 
vocabularies  can  be  accomodated 

Representing  all  knowledge  as  conditional  probabilities 
does  not  imply  any  loss  of  power,  since  the  probabilities  can 
be  set  to  zero  or  to  one  whenever  appropriate.  However,  it 
does  require  that  estimates  be  computed  for  all  the 
probabilities  in  the  system.  Fortunalely,  all  these  probabilities 
are  easily  estimated  from  the  frequency  of  occurrence  of 
corresponding  events  in  a set  of  training  utterances. 

General  Model 

Let  the  sequence  X(l>,  X(2>,  X(3>,  ...  , X(T)  be  the 
sequence  of  states  of  a Markov  process  ([3])  with  transition 
matrix  A -(a  ).  Let  Y<  1 ),  Y<2).  Y<3>,  ...  , Y(T)  be  a sequence 
of  random  variables  such  that,  for  all  t,  PROB(  Y(t)«k  | X(t-l)“i, 
X(t)-j  ) - b U'e  a bracket  and  colon  notation  to  abbreviate 
sequences.  Thus  X[1:T]  - X(l>,  X(2),  X<3>,  ...  , X(T)  and  Y[1:T] 
- Y(  1 ),  Y(2),  v(3) Y(T).  The  assumptions  of  the  model  are 

PROBf  Y(t)«y(t>  | X[l:t]«x[H],  Y[l:t-l]»y[l:t-l]  ) (1) 

- PROBf  Y(t)-y(t)  | X(t - 1 )— x(t - 1 >,  X(t)-x(t)  ) 

* b,{»  iti (»).:•) 

and 

PROBf  X(t)«x(t)  | X[  1 :t - 1 ]*x[l :t-l ] ) (2) 

- PROBf  X(t)-xft)  | X(t-l)-xfl-l)  ) 

* a, (i  fi  no 

Under  these  assumptions, 

PROBf  X[  l:T]=x[L.TJ,  Y[  1 ,T]-y[l  ;T]  ) (3) 

= It  a.  , b,  ,1,1,111, 

where  a special  extra  state  x(0)  is  introduced  and  a.ic,  and 
b.  oi , are  defined  appropriately. 


it  is  convenient  to  introduce  a special  notation  for  the 
total  probability  of  all  partial  sequecnes  resulting  in  a 
particular  state  at  a particular  time  Let 

c/(s,ji  - PROBf  X(s)*j,  Y[l:s]»y[l:s]  ) (A) 

- E.  . , TT.  a,  . , ,|b,  . , , 

where  the  sum  is  ov  »r  all  possible  sequences  x[l:s-l]  and 
x(s)-j.  The  values  of  c/  for  a given  s can  easily  be  computed 
from  the  values  for  s-1  In  'act, 

i/(s,j)  • E'.x(s-l,'l3  ,b  (5) 

Conditiona,  probabilities  based  on  the  known  sequence 
y[l:T]  can  he  compuled  from  the  function  oc  and  a similar 
function  compu'ed  backwards  in  time  from  the  end  of  the 
sequence  For  jxample, 

PROBt  >.(T)-j  | Y[l  T]-y[l.T]  ) (6) 

- PROBf  X(T)-j  T'l.T]-y[l:T]  ) / PROBf  Y[1  T>y[l:T]  ) 

- c/(T,j)  / E erf  ,i>. 

Each  of  the  sources  of  knowledge  needed  for  speech 
'ecognition  cat  he  represented  witn  this  general  Markov 
fr  amework. 

Renresentation  of  Knowledge  Sources 


Representing  Acoustic-Phonetic  Knawi  d&£ 

There  are  several  choices  in  how  to  represent 
acoustic-phonetic  knowledge  A decision  must  be  made 
whether  acoustic  observations  sh  jld  be  preprocessed  by 
specialized  procedures  or  whether  1 a stochastic  model  should 
deal  directly  with  the  acoustic  parameters.  To  simplify  the 
exposition,  consider  just  the  case  in  which  spec  alized 
preprocessing  is  done. 

Assume  that  at  each  time  t (lStJT),  an  acoustic 
observation  is  made  Each  such  observation  consists  of  a 
vector  of  values  of  a set  of  acoustic  parameters,  which  in  the 
stochastic  model  is  represented  by  a vector-valued  random 
variable  Yft).  There  is  a sequence  of  phones  P[1:J]  which  is 
produced  during  the  time  interval  1 < t < T.  Assume  that  the 
phones  occupy  disjoint  segments  of  time  that  is,  assume  there 
is  a sequence  So  < s < s?  * s5  < ...  < s such  that  Pfj)  lasts  from 
observation  Yfs,  ,)  through  observation  Yfs.-l).  (Set  So  * 1, 
Sj  - T.). 

Let  p[l:J)  be  the  actual  sequence  of  phones  in  an 
utterance  and  let  y[l:T]  be  the  actual  observed  sequence  of 
acoustic  parameters  For  convenience,  also  introduce  a special 
initialization  phone  p(0)  which  is  assigned  a special  value  to 
allow  the  initial  probabilities  to  have  the  same  form  as  the 
transition  probabilities  later  in  the  sequence  Since  the  actual 
times  Si,s„Sj,.  .,s,  , are  not  known,  it  is  necessary  to  associate 
each  arbitrary  segment  of  time  with  some  phone.  For  any  pair 
of  times  t,  and  t,  let  S(t„t>)  be  that  value  of  j for  which  the 
expression  (Min(s„t,)-Max(s  ,,t ,)>  is  maximized  If  t,<  1 then 
set  S(t„t,)  - 0. 

The  acoustic  preprocessor  tries  to  estimate  a phonetic 
transcription  from  the  acoustics  alone.  By  looking  for 
discontinuities  or  rapid  changes  in  the  acoustic  parameters,  the 
preprocessor  divides  Ihe  sequence  Y[1.T]  up  into  K phone-like 
segments  Y[l:t  -1],  Y[t  t - 1 ],  Y[t,:t,-1]....  , Y[t.  t,].  Then  an 
attempt  is  made  to  classify  each  segment  Y[t  ,:t  -1]  using  some 
form  of  pattern  recognition  procedure  Let  L < t < t,-  < <1, 

be  the  segment  boundary  times  as  decided  by  the 
preprocessoi  and  introduce  the  random  variable  Dft)  which  is 
1 if  there  exists  a k such  that  t.  - t and  is  0 otherwise 
Let  Ffk)  be  the  label  assigned  by  the  preprocessor  to  Ihe 
segment  Y[t  ,t,  - ! ] (For  completeness,  set  t -t  -1  for  k<0, 
and  L -t,»T  for  k>K.) 
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for  r ome  pattern  matching  procedures  it  is  possible  to 
ctrectiy  est  mate  conditional  probabilities  When  using  su.h  a 
procedure,  let 

a;PlK]  - PROB(Y[t  :t  i ]*y[t.  ;t  -1]  | P(S(t,  „t.))-p).  (7) 

The  pattern  matching  procedure  might  yield  only  the  label  r(k) 
representing  a best  guess  as  to  the  underlying  phone.  In  such 
a case  it  is  necessary  to  estimate  the  conditional  probabilities 
fron  statistics  ot  performance  by  the  oattern  matcher  on 
training  data.  Let  f[l:K]  represent  the  actual  sequence  ot 
labels  generated  by  the  pattern  recognizer  tor  the  utterance 
being  considered.  Then  set 

B[p,k]  - PROB(F(k)-f(k)  | P<S(t.  ,,t,))-p),  (8) 

where  the  conditional  probability  is  estimated  by  the 
frequency  of  such  events  in  a set  ot  training  utterances. 

In  addition  to  estimating  the  probability  of  substituitons 
or  confusions,  it  is  necessary  to  estimate  the  probability  of  the 
preprocessor  producing  either  too  many  or  too  few  segments. 
The  probability  ot  such  events  may  be  estimated  from  their 
frequency  of  occurence  in  a set  ot  t.ammg  utterances.  Let 

E[p  ,p„n]  - (9) 

PROB<  DO  ;)  =D(t  wxt>l,  D[t.  ,*l:t,  -l]-0,  0[t.  ,*l:t.-l ]-0  | 

P(S(t  „t.  ))-p  , P(S(t.  „t,))-p„  SO.  ,.t  )-$<t.  ,.t.  ,)-n  ). 

tt  tne  acoustic  prerr  Jcessor  is  reliable,  then  E[r„o„n]  should 
be  smell  except  to  n«l  and  should  be  negligible  tor  n>2.  In 
the  DRAGON  system,  it  has  arbitrarily  been  assumed  that 
E[P:,P..,n]=0  tor  n>4  Note  that  E[p  ,P),01  is  undefined  and 
meaningless  unless  p «p,. 

We  can  now  estimate  the  conditional  probability  ot  the 
sequence  Y[1:T]  given  the  sequence  P[ I : J J. 

PROBf  Y[  l:T]*y(  t :T]  | P[0:J]-p[0:J])  (10) 

« Z.  . II  . ,B[p(z(k)),k]E[p(z(k-l  )),p(z(k)),n(k)], 

where  z(k)  » E n(i)  and  thfl  sum  is  taken  over  all  sequences 
n[l  K]  such  that  zO  )«J.  (By  convention  z(O)-O). 

in  order  to  apply  the  theory  of  a probabilistic  function 
of  a Markov  process,  it  is  necessary  to  specify  the  transition 
probabilities  tor  the  phone  sequence  P[1:J}  It  is  the  task  of 
the  other  sources  ot  knowledge  to  specify  these  probabilities. 
Phonological  rules  may  be  represented  either  directly  or 
indirectly  in  the  estimates  ot  E[p  ,p,,n]  and  8[p,k],  but  all 
higher  levels  of  the  hierarchy  deal  only  with  the  sequence 
P[1:J]  and  are  msolated  from  the  acoustics  Y[l:T]or  the  labels 

F[  1 '•*<]■ 

Beprcsentaticn  of  Lexical  Knowledge 

This  section  discusses  the  computation  ot  the 
conditional  probability  PR0B(P[l:J>p[t:J]  | W[l:I]«v/[t:I]) 
where  W[  1:1]  is  the  sequence  of  v.u  ds  in  the  utterance  and 
P[i-J]  is  the  sequence  ot  phones  Knowledge  ot  the  sequecne 
of  word'  in  an  utterance  is  such  a strong  determiner  of  the 
sequence  of  phones  that  it  is  unusual  to  formulate  the 
conr  ection  as  a stocastic  process  Nevertheless,  the  stochastic 
formulation  can  represent  the  same  rules  as  other  formulations 
and  in  a compact  and  computationally  convenient  form. 

Let  s first  consider  how  alternate  pronunciations  of  a 


partic  ilar  word  can  be  represented  by  a probability  network. 
As  an  example,  take  the  word  “alway  " as  used  in  the  ARCS 
(Automatic  Recognition  of  Contmuou  Speech,  IBM  Rockwell) 
system([9],[5]),  There  are  432  pronunciations  which  are 
allowed  The  ARCS  system  can  have  such  a complete  list  ot 
phonetic  variants  because  it  i es  a network  representation  of 
the  alternatives  and  constraints.  Some  soeech  understanding 
systems  use  an  explicit  list  of  alternate  pronunciations,  either 
generated  automatically  from  a phrnermc  dictionary  or 
preselected  by  hand.  But  an  easy  way  to  represent  an 
exhaustive  list  of  alternate  pronunciations  is  by  a network 
The  network  representation  tor  "always"  is 


always. 

ts)  v¥ 


A 

V 


where  tl.e  dots  (.)  are  dummy  nodes  introduced  so  that 
the  network  can  be  shown  in  two  dimensions.  We  have 
represented  the  phones  as  nodes  rather  than  as  arcs  (which 
would  be  even  more  compact)  because  such  a representation 
fits  more  easily  into  the  integrated  system  The  node-based 
representation  permits  explicit  representation  of  sequential 
constraints  (such  as  the  restriction  that  it  /u/  is  omitted,  then 
the  following  vowel  cannot  also  be  omitted). 

The  network  representing  alternate  pronunciations  of  a 
given  word  can  either  be  derived  by  hand  and  stored  in  a 
dictionary  ot  word  networks,  or  can  be  derived  by  automatic 
procedures.  The  automatic  procedures  take  a canonical 
pronunciation  and  apply  phonological  rules  to  produce  a 
network  representing  all  likely  pronunciations  ot  the  word. 
Even  it  alternate  pronunciations  of  words  are  no!  derived  by 
rule,  the  phonological  rules  are  still  important  because  many  of 
them  can  apply  across  word  boundaries. 

The  process  of  applying  phonological  rules  is  one  way 
in  which  the  DRAGON  system  deviates  from  the  conceptual 
hierarchy.  The  syntax  and  semantics  ot  a particular  task  is 
represented  by  a network  in  which  each  node  corresponds  to 
a word.  Using  either  a dictionary  ot  canonical  pronunciations 
or  a word  network  dictionary,  a small  network  is  substituted 
for  each  word-node.  The  result  is  a network  in  which  each 
node  is  an  individual  phone.  Tne  phonological  rules  are  then 
repeatedly  applied  to  the  network.  For  each  phonological  rule 
the  entire  network  is  searched  to  tmd  any  nodes  which  satisfy 
the  context  conditions  of  the  rule  Each  rule  provides  an 
alternate  pronunciation  of  some  sequence  ot  phones  If  the 
alternate  pronunciation  is  not  already  represented  then  an 
extra  branch  is  created  in  the  network  representing  the 
sequence  ot  phones  tor  the  alternate  pronunciation.  This 
process  applies  across  word  boundaries  as  well  as  within 
words.,  depending  on  the  phonological  rule  Cond  tiorai 
probabilities  tor  the  different  branches  ot  the  phonetc 
network  are  estimated  trom  frequency  ot  occurence  statistics 
tor  a set  ot  hand  transcribed  sentences.  Such  prooab  lities 
could  even  be  made  dialect  dependent  or  even  taiker 
dependent.  Note  that  !hp  tranmg  sentences  only  need  to  be 
phonetically  transcribed,  it  is  not  necessary  to  know  the  t me 
at  which  each  phone  occurs  since  at  this  level  we  are  no 
longer  dealing  directly  with  acoustics. 
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The  explicit  representation  of  phonological  rules  m the 
network  is  easily  achieved  at  an  expense  of  doubling  o- 
tnplirg  the  number  of  nooes  in  the  network.  However,  with 
the,  stochastic  network  model  it  is  not  essential  that  an 
exhaustive  set  of  phonological  rules  be  used.  In  tact 
implementations  of  the  DRAGON  system  have  been  made  with 
no  explicit  phonological  rules  and  only  one  canonical 
pronunciation  for  each  word.  The  reason  that  this 
representation  ,s  possible  ,s  that  any  phonological  phenomena 
wh.ch  are  not  introduced  explicitly  will  oe  treated  at  the 
acoustic -phonetic  level  Thus  phonological  substitutions  can  be 
rnimiced  by  adjusting  the  probabilities  ,n  the  matrix  Bfp.k]  to 
include  the  probability  that  p -s  not  the  actual  phone  used  by 
the  'a  ker  but  -atiier  that  some  other  phone  q is  spoken. 
Similarly,  phonological  insertions  and  deletions  can  be  treated 
by  adjusting  the  probabilities  in  the  matrix  E[p  ,Pjlnl  The 
disadvantage  of  this  approach  ,s  that  the  matrices  B and  E 
represent  less  context  than  ,5  available  in  the  explicit 
reprpsent?uon  of  the  phonological  rules. 

There  is  a serend  pilous  benefit  in  using  the  matrices  B 
and  E to  represent  acoustic -phonetic  knowledge  independently 
from  the  representation  of  the  phonological  rules.  It  the 
matrices  B and  E are  estimated  by  running  the  acoustic 
preprocessor  0 1 a collection  ol  test  utterances,  then  any 
phonological  rules  which  are  left  out  in  the  prepared  labeling 
Of  the  test  utterances  are  automatically  absorbed  into  the 
estimates  ot  B and  L Thus  a perfect  hand-labeled 
transcription  of  the  test  utterances  ,s  not  only  unnecessary, 
but  undes  raole  The  best  labeling  for  training  purposes  ,s  an 
automat  call>  generated  labeling  trom  a procedure  knowing  the 
sequence  o»  words  and  having  exactly  the  same  lexical 
Knowledge  and  phonoiog.cal  rules  as  the  speech  understanding 
system.  6 


BcpxfiicnMij 


The  syntax  and  semantics  ot  a specific  task  domain  can 
e represented  by  a multi-level  network  corresponding  to  a 
ar  O ' process  Consider  as  a task  a spoken  chess  move. 
Chess  has  specialized  grammar  as  well  as  a specialized 
vocabulary!^, [7]).  Leaving  aside  a tew  special  moves,  a move 
can  be  represented  by  a path  through  the  following  network: 


Thp  nodes  m the  above  network  are  not  in  general 
individual  words,  but  are  subgrammars  which  are  themselves 
represented  by  networks.  For  example: 


Again,  the  nodes  can  be  expanded  as  networks: 


It  is  c ear  that  any  regular  (finite  state)  grammar  can 
be  represented  by  a finite  network.  But  in  a speech 
understanding  system  the  distinction  between  5 regular 
grammar  and  an  arbitrary  context-dependent  grammar  is 
somewhat  artificial  Consider  the  language  ot  utterances 
generated  by  a particular  grammar,  nc  the  sequence  ot  words 
but  the  sequence  ot  acoustic  events  It  is  not  unreasonable  to 
assume,  for  example,  that  each  entry  in  B[p,k]  is  non-zero, 
although  perhaps  very  small  Such  a result  would 
automatically  be  the  case,  for  example,  if  the  conditional 
probability  distributions  tor  the  acoustic  parameters  are  multi- 
variate normal  distributions. 

But  it  each  entry  in  B[p,k  is  non-zero,  then  at  the 
acoustic  level  the  language  must  include  all  possible  sequences 
Such  a language  can,  ot  course,  be  represented  by  a finite 
network  grammar  Thus  the  issue  becomes  not  one  of 
generating  the  proper  language,  but  rather  one  ot  modeling  as 
accurately  as  possible  the  conditional  probabilities,  which  can 
be  .ontext-dependent  even  (or  a cortext-tree  grammar. 
Context  is  represented  in  the  network  by  having  separate 
nodes  tor  subgrammars  which  differ  on  y with  respect  to 
context  For  example,  in  the  chess  grammar  there  are  two 
nodes  marked  "piece,"  one  describing  the  piece  which  is 
moving  and  one  describing  a piece  which  is  captured.  There  is 
clearly  a trade  off  between  * 'e  size  of  the  state  space  and  the 
amount  of  context  which  can  be  represented.  For  specialized 
tasks  it  is  not  difficult  to  achieve  a reasonable  representation 
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of  t fie  grammar  using  most  words  at  no  more  than  two  or 
three  nodes.  The  transition  probabilities  for  the  grammar 
network  can  be  estimated  frnm  statistics  for  a set  of  training 
sentences  A large  set  of  training  sentences  should  be  used, 
but  they  only  need  to  be  transcribed  orthographically,  not 
phonetically,  at  this  level  of  the  hierarchy  If  Bayesian 
statistics  are  used,  the  a priori  probabilities  could  be  set  to 
achieve  the  same  effect  as  a non-probabilistic  use  of  the 
grammar.  Trie  a posteriori  probabil  ties  would  then  be  a strict 
improvement  (as  judged  by  the  training  sentences) 

To  ttie  extent  to  which  the  statist's  of  the  training 
sentences  reflect  the  true  probab  lilies  for  spontaneous 
utterances  for  the  specific  task,  the  probability  network 
represents  not  only  the  syntax  of  the  task  but  also  all  of  the 
recognition  information  which  can  be  obtained  from  the 
semantics  of  the  available  context.  That  is,  assuming  the 
probabilities  are  correct,  the  probability  network  is  an  optimal 
predictor  for  a given  amount  of  context,  and  therefore  predicts 
at  least  as  well  as  a human  who  is  given  the  same  amount  of 
context  and  who  presumably  understands  the  sentence 
(although  t hr  context  in  this  case  is  not  the  whole  sentence). 

lntt  r -mntence  semantics  can  also  be  introduced  into 
the  probability  network.  One  way  to  use  inter-sentence 
semantics  is  to  employ  a user  model.  Suppose  there  is  a model 
for  the  user  in  a particular  task  whicn  gives  probabilities  for 
the  user  transitioning  among  a finite  number  of  states 
depending  on  the  types  of  utterances  which  the  user  has  made 
in  the  past.  Conceptually  this  model  fits  in  easily  as  an  extra 
level  in  the  Markov  hierarchy.  Computationally  it  requires  that 
conditional  probabilities  be  estimated  separately  for  each  user 
state  However,  since  the  user  transitions  between  states  only 
between  utterances,  a given  utterance  is  analyzed  using  only  a 
single  representation  of  the  probability  network  The 
probabilities  in  this  single  network  are  weighted  averages  of 
the  probabil  ties  for  the  various  user  states  A user  mode!  is 
especially  valuable  if  certain  key  sentences  trigger  user  state 
transitions  with  probability  one  and  if  for  each  user  state  a 
small  subset  of  the  general  grammar  is  used  Then  there  is  a 
savings  in  both  computation  and  storage  requirements 

PERFORMANCE  RESULTS 

The  testing  of  the  system  is  still  at  too  preliminary  a 
stage  to  make  any  definitive  conclusions,  but  initial  results  are 
very  promising.  Simulation  studies  have  shown  that  the 
system  can  perform  well  despite  a high  error  rate  in  the 
acoustic  preprocessor.  In  its  first  test  with  live  speech  input, 
the  system  correctly  recognized  every  word  in  all  nine 
sentences  in  the  test 
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ABSTRACT 

Simple  schemes  are  presented  tor  segmenting  and 
labeling  continuous  speech  which  a-e  independent  ot  the  acoustic 
parameters  used  as  input.  Central  to  this  approach  is  the  belief 
that  simple,  parameter-independent  structure  is  desirable  at  this 
level  of  speech  recognition:  1)  tor  comparisons  among  the  various 
parametric  representations  for  speech,  2)  to  provide  a benchmark 
for  any  other  scheme  purporting  to  be  better  in  either 
segmentation  or  labeling,  3)  to  avoid  encoding  in  the  algorilhms  the 
limitations  ot  a representation,  4)  to  allow  for  more  automatic 
training  and  adjustment,  and  5)  to  study  schemes  that  permit 
etficient  hardware  re  alization. 

The  segmenter  is  based  upon  the  idea  that  sig nit :c ant 
change  in  a parameter  should  be  sufficient  evidence  tor  e boundary, 
and  that  this  evidence  can  be  collected  and  viewed  as  a sum  ot 
weighted  votes.  A two-stage  threshold  network  collects  the  vote 
sum  and  locates  boundaries  at  local  maxima  in  the  sum,  thus 
allowing  context  to  have  an  eftect 

The  labeler  takes  a well  accepted  viewpoint  from  pattern 
dassitication  research  — that  distance  in  the  space  of  acoustic 
parameters  is  strongly  related  to  similarity  in  acoustic  nature. 

Three  sets  of  acoustic  parameters  are  used  as  irput  to 
the  two  procedures:  amplitude  and  zero-crossing  coun.s  from 
octave  band-pass  tillers  (ZCC),  smoothed  IPC  derived  spectrum 
en  elopes  (SPG),  and  the  frequencies  and  amplitudes  of  the  first 
live  peaks  in  the  SPG  (FMT).  A straightforward  training  process  is 
undergone  tor  each  parametric  representation.  Results  are 
presented  tor  a set  ot  utterances  spoken  by  the  same  speaker  as 
the  training  corpus.  The  results  obtained  compare  with  human 
performance  in  segmenting  and  labeling  with  no  syntactic  or 
semantic  support. 

INTRODUC!  ION 

Attempts  at  computer  recognition  of  continuous  speech 
have  clearly  pointed  out  the  need  tor  methods  tor  dr  id.ng  the 
speech  signal  into  discre'e  acoustic  segments  and  for  labeling  those 
segments  in  as  accurate  and  robust  a manner  as  possible.  A 
number  of  specitic  methods  tor  segmenting  based  upon  particular 
acoustic  parameters  have  been  proposed,  (see  for  example,  Fant6i, 
Reddy66,  I>nes68,  Broad72)  We  believe  that  simple  umtorm  kinds 
ot  algorithm-,  may  be  applied  to  the  problem  ot  segmentation  and 
labeling  ot  continuous  speech  m a manner  independent  ot  the  choice 
ot  parametric  representation  ot  the  speech  signal.  Although  they 
may,  doubtless,  be  improved  by  application  ot  specitic  knowledge 
about  the  response  ot  the  parameters  to  particular  speech 
phenomena,  this,  knowledge  has  not  yet  been  coditied,  or  even 
aquired  in  suf tic ie nt  breadth  to  support  comparisons  among  the 
representations.  The  possible  variations  upon  the  methods  for 
extracting  parameters  trom  the  acoustic  signal  are  endless,  so  >t  is 
imperative  that  a reasonably  effective  way  ot  employing  any  such 
representation  be  tound. 


We  will  propose  two  such  algorithms  as  benchmarks  We 
do  not  expect  them  to  perform  as  well  as  more  heuristic  methods 
with  sigmticant  amounts  ot  speech  knowledge,  but  they  will  provide 
as  good  an  input  to  the  higner  levels  ot  speech  recognition  as  is 
tound  it  many  earlier  systems  and  may  be  used  as  ar  otf-the-shelt 
package.  In  addition,  an/  method  that  proposes  to  advance  the  state 
ot  the  art  should  do  sign  tic  ant  ly  better  than  these  schemes 

There  can  be  strong  interaction  between  the  segmenter 
and  labeler.  Information  about  segmen'  identities  may  be  used  "0 
verify  or  correct  boundaries,  on  the  other  hand,  the  assoc. ation  t 
the  input  within  a segment  as  all  contributing  to  a single  sourAi 
provides  extra  information  to  the  labeling  process  In  the  over4’ 
recognition  system,  these  two  processes  combine  to  torm  a sourcl 
ot  knowledge  that  transtorms  the  acoustic  signal  into  a sequence  ^ 
discrete  segmental  phonetic  identhers.  Later  processing  by  higheJ 
levels  may  transtorm  that  sequence,  correct  it  by  applying  rules  cj 
phonetic  context,  or  even  go  back  to  the  acoustic  input  in  condi'.iorfl 
that  warrant  more  caretul  but  expensive  analysis.  Primarily,  w‘ 
must  deal  with  this  level  as  a data  reduction  and  transformation 
process. 

Form  ot  the  Problem  and  Previous  Methods 

Most  methods  (or  analyzing  the  acoustic  signal  result  in  a 
vector  of  parameters  at  regular  intervals  in  time.  (1)  The  elemen  s 
of  this  vector  may  be  considered  as  measurements  ot  teatures  or  as 
parts  of  an  overall  descriptor  ot  the  acoustic  stai  ot  the  signal.  A 
great  deal  ot  ettort  has  gone  into  the  search  tor  a set  Ot  such 
parametric  measurements  that  display  usetul  properties  --  complete 
intormi’tion  (as  veritied  by  human  perception  experiments), 
orthogonality  (or  independence  --  for  better  data  compaction), 
independence  of  variations  in  speaker  and  equipment 

characteristics,  etc.  It  was  hoped  that  such  parametric 

representations  would  lend  themse  ves  to  Ihe  least  errortul  possible 
labeling  of  the  phonetic  content  ot  the  signal. 

We  are  concerned  here  with  the  actual  use  ot  these 
parametric  representations  ot  the  signal.  We  have  a number  ot 
goals  other  than  that  ot  improving  accuracy.  They  stem  directly 
trom  deticiencies  we  tee  I are  present  in  the  current  approaches  at 
this  levei  Previous  approaches  have  been  ad  hoc  in  their 
development.  Typically,  a 'epresentat  on  is  studied  for  its  acoustic 
properties  and  the  information  obtained  is  coddied  in  specialized 
rules.  Even  application  of  standard  pattern  classification  methods  is 
adapted  to  the  particular  pattern  space  by  heuristic  selection  of 
weights  and  ot  classes  based  upon  the  strengths  and  lim  tations  ot 
the  representation.  Thus,  there  i<  no  clear  distinction  between  the 
ettkacy  of  an  algorithm  tor  labeling  or  segmenting  and  that  ot  the 
particular  acoustic  parameters. 

1)  There  are  rarely  any  comparative  studies  availaole 
because  ot  the  oependence  ot  each  system  upon  a priori 
assumptions.  We  would  like  to  be  able  to  perform  comparisons 
among  the  parametric  representat  ons,  the  results  of  which  we  are 


(1)  "his  research  wae  supporteo  in  part  by  the  Advanced  Research 
Projects  Agency  ot  the  Department  ot  Defense  under  contract  no. 
F44620-73-C-0074  and  monitored  by  the  Air  Porce  Office  ot 
Sc  tent  it  ic  Research 


(1)  We  also  adopt  the  convention  ot  vectors  at  unitorm  intervals. 
However,  other  methods  (Baker74)  show  promise  precisely  because 
they  do  not  average  measurements  Over  short  time  intervals,  but 
rather  measure  specific  events  in  time  and  t he  intervals  between 
them. 
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confident  can  be  extended  to  more  heuristic  ally  coded  production 
versions  of  the  segmentation  ano  labeling  processes. 

2)  We  would  I'Ke  to  p esent  a benchmark  to  the 
community,  witn  enough  performance  capability  to  support  a 
reasorabe  recognition  syste  i,  but  which  must  be  surpassed  if  the 
ether  goats  discussed  here  are  sacnticed 

3)  Many  sets  of  parameters  are  correlated  in  well 
understood  ways  with  one  another,  such  as  amplitude  measurements 
in  filter  bands.  In  dealing  with  filter  arrays,  for  example,  one  otten 
implicitly  encode'  ne  concept  of  closeness  in  frequency  with 
Closeness  in  the  array  We  do  not  want  to  encode  the  structure, 
and  ‘he  limitations,  o a particular  parametric  representation  into 
the  algorithms  uniess  wt  are  satisfied  that  the  advantages  ot  doing 
so  outweigh  the  loss  of  generality  and  flexibility.  While  there  is 
nothing  wrong  end  much  to  be  gained  in  using  this  information  — 
the  best  systems  will  have  to,  we  would  like  to  have  some 
confidence  in  our  choice  of  parametric  representation  betore  we  do 
so. 

4)  As  well  as  comparative  rating  of  parametr.c 
representations  and  benchmarks,  there  is  also  a need  for  methods 
that  are  straighttorwaro  in  structure  and  implementation.  Available 
schemes  tor  unsupervised  learning  and  for  tracking  of  slowly  or 
infrequently  shifting  clusters  in  the  pattern  space  depend  tor  their 
success  upon  an  uncomplicated  model  of  pattern  classes  and 
uniform  treatment  of  the  dimensions  of  the  pattern  space. 

5)  Such  algorithms  are  more  easily  realized  in  hardware, 
with  the  consequent  speedup  so  available.  Since  the  algorithms  are 
designed  to  be  independent  ot  the  particular  acoustic  parameters, 
fixing  them  in  hardware  will  not  be  as  big  a risk  as  one  might  think. 

Other  problems  arise  in  dealing  with  variations  introduced 
by  litferent  speaker  and  equipment  characteristics,  or  ditterent 
vocabularies  and  hence  phonetic  'or.texts.  Their  etfects,  while 
significant  to  the  operation  of  a complete  speech  recognition 
system,  are  secondary  in  this  con'axt.  We  expect  that  the  results 
obtained  over  uniform,  high-quali'y  data,  with  the  simple  algorithms 
we  e'»  proposing  here,  will  d grade  gracetully  with  the  introduction 
of  other  sources  of  vanation  and  noire  into  the  data. 

SEGMENTATION 

The  first  process  we  wouid  like  to  appiy  to  the  input  data 
is  to  segment  it  in  time  into  related  acoustic  segments.  This  is  often 
done  at  a later  s'age,  after  some  labeling,  or  at  least  recognition  of 
features  suen  as  voiced",  "fricated",  or  "silence"  has  been 
attempted  at  small  reguiar  interval.  However,  our  approach  is  to 
attempt  the  segmentation  initially,  in  order  to  have  that 
segmentation  as  useful  input  to  the  labeling  process.  If  we  err  in 
favor  ot  too  many  boundaries,  we  may  always  combine  segments 
with  similar  labels,  once  those  labels  are  placed. 

Evidence  tor  Segment  Houndary 

Clearly,  the  concept  of  acoustic  similarity  and  difference 
is  central  to  rny  segmentation  procedure  (1)  Thus,  one  might, 
instead  of  labeling  exch  interval,  label  the  interstices  between  the 
intervals,  i.e  measure  the  difference,  according  to  some  classifying 
ruie,  bet  veer  adiacent  intervals.  Tne  distance,  in  some  parameter 
space  tor  example,  between  the  pattern  associated  with  a noise-like 
nterval  (freahve)  and  that  of  a nasal-like  interval  would  be  great 
enough  to  signal  the  placement  of  a boundary,  while  the  distance 


(1)  if  each  interval  >s  labe  ed  a some  acoustic  type,  the  grouping 
together  of  strings  of  these  lal  eis,  as  is  often  done,  accO"  ng  to 
higher  or  broader  type  classification  is  just  an  assertion  that 
boundames  should  occur  vhere  adjacent  intervals  belong  to  very 
difterent  acoustic  types. 


be i veen  palterns  for  high  and  middle  vowei-like  sounds  might  not 
and  orobably  should  not  There  is  defmitly  an  element  of  risk  in 
adopting  such  a decision  strategy  --  that  important  boundaries  will 
be  missed  because: 

1)  the  distance  measurement  is  not  sensitive  to  change 
in  certain  directions  in  the  acoustic  space, 

2)  the  parameters  do  not  reflect  such  changes, 

3)  the  change  is  too  slow, 

or  4)  the  magnitudes  of  the  changes  va'y  considerably 
witn  context, 

and  thus  not  be  susceptible  to  easy  decision  rules  Problems  1 and 
2 will  bother  any  segmentation  procedure,  and  must  be  solved  by 
choosing  better  parametric  representations  for  speech.  The  problem 
Ot  slow  change,  3,  will  also  plague  many  difterent  algorithms.  It  is  a 
peculiarity  of  speech  that  must  be  dealt  with.  Problem  4,  varying 
magnitude  of  change,  can  be  approached  fairly  simply  by  treating 
the  change  as  a signal  in  time  We  will  show  one  possible  approach 

Thu  rule  we  have  chosen  is  based  upon  the  idea  that 
each  parameter  of  the  speech  signal  can  be  viewed  as  a separate 
channel  of  lime  varying  information  about  the  utterance  (Figure  I) 
A sudden  or  s gnificant  change  in  even  a few  channels  should  signal 
a boundary  Thus  we  may  collect  evidence  about  the  placement  of 
boundaries  by  placing  a threshold  on  the  change  in  each  channel, 
and  report  when  the  threshold  is  exceeded  over  adjacent  time 
interval  ditterences.  Because  we  expect  some  changes  to  occur 
gradually,  we  measure  the  change  between  intervals  one  unit 
further  away  (a  total  of  three  units)  as  we'l  and  allow  them  to  react 
to  another  set  ot  thresholds. 
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Figure  I — Segmenter  VotmR 

A second  stage  is  needed  to  combine  these  votes  for 
change  Ot  the  individual  threshold  units.  It  this  were  a Per ceptron 
recogn. zer,  this  second  stage  would  not  have  memory  and  could  not 
take  context  in  time  into  account  except  as  it  was  explicitly 
measured  by  a weighted  combination  of  the  primary  stage  units 
However,  this  will  not  work  when  changes  vary  in  magnitude  and 
the  number  of  channels  atfected.  The  acoustic  context  greatly 
aftects  the  suddenness  and  seventy  of  a boundary.  There  is  no 
threshold  that  can  be  used  on  the  vote  sum,  for  example,  an  overall 
vote  level  that  specified  change  from  fricative  to  vowel  would  be 
too  large  to  work  for  silence-nasal  transitions  where  only  a few 
parameters  may  change  The  situation  we  wish  to  recognize  s that, 
whatever  change  does  occur,  it  s greatest  at  the  point  we  wish  to 
mark.  Thus  a local  maximum  is  found  in  the  sum  of  the  threshold  un  I 
votes.  These  votes  are  weighted  to  emphasize  th"  adjacent  interval 
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d t'crencex  o e'  the  longer  slower  changes  However,  the  actual 
diHe'ences  are  not  summed.  Rather,  a vote  tor  change  is  considered 
to  be  ot  eqjal  importance  trom  any  channel  it  it  triggers  over  that 
channel  s threshold. 


Transition  Segments 


The  local  search  toi  maxima  can  also  incorporate  a 
measjrment  ot  ccpe  or  area  to  try  to  characterize  the  gradualness 
of  change  Broader  peaks  in  the  vote  sum  wil  indicate  transitionary 
portions  ot  the  signal  which  are  changing  acoustically  over  a longur 
time  than  usual.  We  have  had  some  succes  ir  distinguishing  ouch 
segments  by  measur  ng  the  width  ot  the  vote  sum  at  a given  drop 
below  ea«.h  peak.  It  tne  width  exceeds  some  preset  limit,  the  peak  is 
considered  to  represent,  not  a boundary,  but  a ransitiOnary 
segment  and  s marked  accordingly ( I ) Difficulties  occur  because: 
OSome  such  segments  are  transitionary  only  in  one  channel  or  only 
slightly  as  compared  to  the  entire  signal.  Hence  the  vote  sum  itself 
is  very  low  and  the  width  measurement  is  ccnfused  by  noise  ettects 
from  other  channels,  u)  Noise  in  the  signal  or  parameter 
measurements  may  g ve  the  effect  of  transitionary  segments  by 
masking  a sharp  change  While  the  method  described  here  has  not 
yielded  what  wp  would  consider  good  identification  of  transitionary 
segments,  it  has  improved  the  loc  lion  of  boundaries  This  suggests 
that  we  may  not  have  a clear  idea  of  what  kind  of  phenomenon  we 
mean  by  "transihon  " (See  Figum  t for  some  examples.) 
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Figure  2 --  Vote  Sum  Peak  and  Transition  Location 


Training 


importance  ot  the  ihannels  may  be  learned  from  the  training  data 
fairly  easily  These  few  parameters  form  a small  set  of  values 
through  which  one  may  search  with  a corpus  Of  hand  marked  data. 

A more  direct  method  for  learning  the  ‘hresholds  is  *o 
collect  the  values  o*  the  deference'  at  hard  marked  boundaries  on 
a train'ng  set  of  utterances  At  each  boundary,  the  largest 
difference  for  each  time  span  (1  or  3 for  example)  is  considered 
relevent  and  is  used  to  force  the  threshold  down  to  its  value  A 
preliminary  look  at  histograms  ot  these  Differences  (Figure  3)  will 
show  a level  belov/  which  one  should  ignore  the  boundary  as 
spurious.  (Often,  hand  labeled  boundaries  do  not  occur  at  points  of 
any  acoustic  change,  but  represent  the  segmenter’s  idea  of  a 
phonemic  boundary.)  The  resultanf  thresholds  should  be  able  to 
recognize  at  least  all  the  non-spurious  hand  marked  boundaries,  and 
probably  will  mark  more  This  tendency  towards  to  many  boundaries 
is,  we  feel,  the  least  of  many  evils.  The  frequencies  with  which  a 
channel  show  the  greatest  change  will  give  a good  idea  of  relative 
importance  for  voting  weights,  if  they  are  desired 
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Figure  3 --  Histogram  of  Differences  at  Hand  Marked  Boundaries 
2nd  Formant  Amplitude  Parameter 

LABELING 

A great  many  algorithms  have  been  propose^  j labeling 
a piece  of  acoustic  data  with  its  phonetic  type.  Enough  this 
problem  seems  to  fit  directly  into  the  basic  pattern  classification 
model,  and  although  pattern  classification  research  has  developed 
methods  for  a variety  of  situations,  the  general  consensus  has  been 
that  these  methods  are  not  sufficiently  powerful  to  solve  the 
speech  recognition  problem  We  feel  that  this  is  a negative  reaction 
to  initial  failures.  Even  though  the  identification  of  phonemes  by 
uniform  classification  rules  will  probably  not  be  accomplished  -- 
there  is  no  reasonable  representation  of  the  acoustic  signal  that 
con'ains  all  the  needed  intormation  about  context,  prosodies,  and 
coarticulation  to  allow  classification  at  the  phonemic  level  --  the 
methods  developer  for  classifying  vector  patterns  can  be  successful 
it  greater  effort  is  spent  in  choosing  the  classifier,  aquiring  the 
relevent  statiM'.s,  and  choosing  the  proper  classes  for  speech.(l) 


F naily,  we  must  specify  the  threshold  values  to  be  used 
for  the  primary  stage  voting  These  will,  of  necessity,  deoend  upon 
the  parametric  representation  chosen,  but  will  depend  in  a uniform 
manner  upon  it  By  unctorm  we  imply  that  a standard  procedure  for 
tram'ng  will  be  sufficient  and  may  be  applied  without  a great  deal 
of  knowledge  about  the  parameter  space  We  have  obtained  good 
resu  ts  by  *etl  ng  ail  the  pr  mary  stage  thresholds  to  the  same 
vaiue  The  resuits  then  depend  upon  that  one  value  and  a 
s gnificance  threshold  used  for  ignoring  small  peaks  in  the  vote  sum. 
The  other  parameters  of  the  process  are  he  weights  of  the 
threshold  unit  votes.  We  have  arg  ed  that  they  ought  to  be  the  sam 
Over  ail  channels,  and  have  fixed  them  at  two  and  one  for  the  one 
and  three  interval  deferences  respectively  However,  the  relative 


(1)  Other  attempts  have  treated  every  boundary  as  a (possibly) 
short  fransit'On  segment,  (Reddy66) 


Distance  in  Pattern  Space 

Oie  important  way  of  view  ng  a pattern  to  be  recognized 
is  that  it  is  represented  by  some  point  in  a space  of  possible 
patterns,  and  central  to  t ha*  conceptualization  is  the  notion  that  the 
distance  between  two  points  in  the  pattern  space  reiates  to 
similarity  of  the  patterns  represented  by  them  We  have  compared 
a number  of  distance  measures  that  are  well  known  to  pat*ern 
classification  research  and  have  chosen  a few  simple  distances  tha! 
essentially  provide  linear  classifying  boundaries  in  the  pattern 


(1)  This  last  issue  involves  a greater  understanding  ot  the  statistical 
nature  of  the  pattern  space  than  is  available.  How  do  the  clusters 
relate  to  one  another,  what  are  the  sip.ni t ic ant  subclasses  of  a 
phone,  and  how  will  the  label  be  used  in  the  rest  of  the  recngmtion 
system’ 
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space  i hey  are  correlalio-i  (the  n-space  angle),  Euclidean  distance 
(the  magnitude  of  the  difference  vector),  and  Euclidean  distance  m a 
variance  normalized  space.U)  (See  figure  4) 
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(Sofl  fricative  or  burst) 
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DH 
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ZH 

(Voiced  fricative) 

HH 

WH 
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SH 

(Fricative) 

YTOT 

WtOT 

RIOT 

L TOT 

(Unvoiced  glide) 

Y 

W 

R 

L 

EL 

(Glide) 

M 

N 

NX 

EM 

EN 

(Nasal) 

IY 

1H 

UW 

UH 

(Vowel  --  neutral) 

EH 

ER 

AX 

OW 

AO 

AE 

AA 

(Vowel  — velarized,  nasa'iz  d,  retroflexed) 


Table  1 --  Phonetic  Classes,  Initial  Definitions 


The  hand  labeling  process,  of  necessity,  involves  some  interference 
by  the  concept  of  phoneme,  although  we  have  attempted  to  label 
> sub-phonemically  — to  label  the  separate  sounds  within  a phoneme 

Once  the  initial  statistics  are  gathered,  training  procedes  by 
■ applying  the  machine  segmenter  and  labeler  to  the  same  ccrpus. 

This  provides  a second  sel  of  labeled  data  which  may  be  used  to 
compute  new  statistics.  In  the  case  of  Euclidean  distance  from  the 
gure  4 --  Decision  Boundaries  cluster  means,  the  process  described  can  be  shown  to  converge  to 

a minimum  intra-duster  scatter.  In  any  case,  after  a few  iterations 
More  complex  part  tiunmg  of  the  pattern  space  can  take  the  clusters  have  changed  their  character  and  can  no  longer  be 
many  forms  <Nagy63,  Me,sel72)  Often  some  estimation  of  the  considered  as  allophones  They  do,  however,  provide  consistent 

density  function  of  the  patterns  withm  each  class  is  made,  then  a labeling  of  a wide  variety  of  acoustic  phenomena,  and  the  phonetic 

Bayes  optimal  rule  is  detmed  by  choosing  the  class  with  the  correlates  of  those  labels  can  be  seen  in  an  inspection  of  the 

greatest  a posteriori  probability  However,  the  computational  training  corpus  and  what  class  labels  occur  in  various  phonetic 

requirements  of  such  a calculation,  the  difficulty  of  estimating  the  contexts, 
densities  in  question,  and  the  fact  that  they  will  have  to  be  easily 

altered  as  conditions  and  speakers  vary  suggest  that  simpler  Multiple  Labels  - the  Entire  Segment  and  Classes  for  Labeling 
methods  be  used 


The  algorithm  ir.  simply  to  compute  the  "distance" 
between  the  unknown  pattern  and  each  of  the  clusters  in  turn.  The 
'lusters  are  defined  by  whatever  statistics  they  may  require,  such 
as  the  mean  and  standard  deviation  of  each  element  over  the 
training  samples.  Although  this  requires  more  computation  than  a 
successive  splitting  of  the  space  into  subsets  of  classes,  it  is  more 
flexible  and  does  nol  require  a hierarchy  of  classes. 

When  the  classes  to  be  recognized  are  composed  o' 
mu't  pie  sub-ciasses  that  are  themselves  more  well  defined  (more 
tigntly  clustered  in  the  pattern  space),  a good  approach  is  to  form 
partitions  that  separate  the  sub-classes  and  then  combine  them  by 
rule  This  is  sometimes  called  drawing  a piece-wise  boundary,  and  is 
cioseiy  related  to  the  nearest -neighbor  and  Parzen  window 
methods.(2)  There  are  well  accepted  hierarchical  divisions  of  the 
speech  sounds  that  may  be  used  to  provide  such  a sub-division.  We 
have  chosen  to  define  simpie  clusters  that  correspond  to  a set  of 
77  most  sigmf  canl  allophones  of  Enghsh<3)  (Table  1).  These  become 
labe  s that  may  take  on  different  acoustic  and  phonetic  meaning  as 
t ammg  progresses  --  they  lose  their  phonemic  meaning  except  that 
we  begin  by  collecting  statistical  information  about  these  allophones 
from  a hand  labeied  corpus  of  data. 


To  enable  the  labeler  to  use  information  from  the 
segmentation,  we  keep  an  ordered  list  of  the  best  few  labels 
(usually  5)  for  each  time  interval  (each  pattern  vector)  in  a 
segment’s  renter  haif.  These  contribute  to  selection  of  the  segment 
label  by  voting  with  a weight  determined  by  their  position  in  the 
ordered  list.  We  have  had  reasonable  results  from  the  weights, 
5,4,...,  but  there  is  clearly  room  here  for  application  of  better 
information.  An  estimate  of  the  a posteriori  probability  of  the  label 
could  be  made  from  the  values  returned  by  the  distance  measure, 
for  example  This  voting  scheme  additionally  provides  an  ordered 
list  of  labels  for  ihe  segment.  We  have  chosen  to  output  the  entire 
list  for  use  in  higher  level  analysis,  since  often  the  top  two  or  three 
labels  are  close  in  their  scores.  The  rules  for  extracting  phonetic 
feat  jres  from  these  sets  of  labels  are  being  developed  ss  a source 
of  knowledge  for  the  Hearsay  11  system  at  Carnegie-Mellon 
University.  (Lesser74)  The  use  of  these  labels  becomes  an 
interesting  problem.  They  clearly  have  acoustic  meaning,  since  that 
is  defined  by  the  cluster  statistics  and  the  classifier  rule  — i.e  by  a 
piece  of  the  paltern  space.  However,  the,  have  phonetic  meaning  as 
well,  because  they  are  interpreted  in  the  light  of  the  phones  (from 
a hand  labeled  corpus)  within  which  they  occur  Th  s acoustic- 
phonetic  correlation  can  be  quantified  and  modeled  by  the 
frequencies  of  the  abovementioned  occurances.  If  the  frequencies 
are  treated  as  probabilities  that  a label  will  be  realized  within  a 
segment  corresponding  to  a particular  phone,  Bayes  rule  can  oe 
used  to  estimate  the  a posteriori  probability  that  the  phone  was 
there.  (1) 


(1)  The  elements  of  the  pattern  vector  are  normalized  by  a weight 
proportional  to  the  standard  deviation  of  elements  of  the  patterns 
in  the  particular  class  in  question. 

(2)  One  level  of  the  Stanford  signature  tables  is  devoted  to  building 
up  a piecewise  description  of  the  pattern  cluster  associated  with  a 
phone  class 

(3)  Shochey,  L.,  Privale  communication,  January  1974. 


(1)  I.  ’s  has  not  yet  produced  good  results  --  possibly  because  the 
m ,oei  s used  to  model  t^e  phone  to  label  probabilities  are  not 
very  good  An  important  element  in  the  Bayes  calc  jtion  is  the  a 
priori  probability  of  each  phone.  This  might  be  supplied  by  the 
higher  levels  from  analysis  of  sequences,  hypotnesized  words, 
probable  length  of  segments,  etc 
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RESULTS  a, id  CONCLUSIONS 

The  results  obtained  with  the  uniform  algorithms  we  have 
• resented  should  be  considered  in  the  light  of  their  usefulness  to  a 
larger  system.  We  recognize  that  these  methods  are  weak  n 
comparison  tu  what  humans  can  do  and  to  what  we  will  need  for 
successful  recognition  of  continuous  speech  with  relatively 
unconstrained  syntax  and  semantics.  However,  Shockey  and  Reddy 
(Shockey74)  measured  accuracy  of  phoneme  identification  by 
humans  working  from  spectrograms,  from  waveforms,  and 
acoustically  in  foreign  language  utterances  where  no  higher  level 
support  was  available.  The  results  they  obtained  may  put  bounds  on 
Our  reasonable  expectations  of  machine  recognizers.  As  one  would 
expect,  acoustic  input  provided  a much  better  identification  rate 
than  the  graphical  representation,  which  were  about  equal.  Yet  the 
actual  rates  were  approximately  307.  (waveform  or  spectrogram) 
and  707.  (acoustic)  for  a set  of  about  50  phonemes.  This  would 
indicate  a considerable  reliance  upon  higher  level  processing  is 
necessary  Identification  into  about  six,  gross  types  occurer*  with 
rates  of  807  and  957.  We  suggest  that  a machine  recognizer  at  the 
local  classification  level  of  a system  wcuid  be  df  "g  well  to  provide 
recognition  in  the  30/80  range  until  more  is  made  available  about 
the  particular  mechanisms  that  enable  humans  to  process  acoustic 
information. 

Table  2 summarizes  the  results  of  the  algorithms  on  three 
different  pa  ametric  representations  for  a corpus  of  five  sentences: 
What  is  the  average  uranium  lead  ratio  for  the  lunar  samples7 
Do  any  samples  contain  troilite7 
Who  is  the  owner  of  utterance  eight? 

Where  were  you  when  we  were  all  away7 
We  all  heard  a yellow  lion  roar 
The  three  representations  were: 

ZCC  --  12  Darameters,  Amplitude  and  Zero-crossing  count 
from  each  of  5 octave  filter  bands  and  unfiltered 
SPG  --  128  smoothed  spectral  envelope  points  from  LPC 
coefficients  (Markel68) 

FMT  — Formant  frequencies  and  amplitudes  from  the  SPG 
envelope,  10  pa.  ameters 

The  labeling  distance  measure  used  was  Euclidean  distance 
weighted  by  the  variance  The  segmenting  thresholds  were  all 
obtained  by  the  training  method  discussed  earlier.  The  utterances 
were  recorded  under  the  same  conditions  and  by  the  same  male 
speaker  as  the  corpus  of  utterances  used  to  gather  statistics  for 
the  labeier,  to  tram  the  segmenter  thresholds,  and  to  refine  the 
duster  set  However,  we  have  observered  oniy  a mild  reduction  in 
accuracy  when  data  rerorded  by  other  male  speakers  is  analyzed. 

ZCC  SPG  FMT 

Labenhg  (percent  correct) 

Exact  Labe:  14  32  8 

Rough  label  69  79  47 

Segmenting  (number  - out  of  134  hand  marked  segments; 
Missing  13  3 6 

Extra  59  138  112 

Table  2 --  Results  of  Machine  Segmenting  and  Labeling 

Remark : ; 1)  The  counts  of  missing  and  extr  segment  boundaries 
are  highly  negatively  correlated,  thus  the  nigh  number  of  extra 
segments  which  SPG  and  FMT  display  explains  their  low  missing 
segment  score  This  was  due  primarily  to  poor  training  of  the 
tresholds  However,  the  extra  segments  were  wsua  ly  labeled 
properly  and  rouid  easily  be  recombined. 

2)  The  rough  labe.  score  is  the  percentage  identified  into  the 
correct  dars  of  about  10  broad  classes  of  speech  sounds.  This  was 
done  to  compare  with  the  foreign  language  experiment  refered  to 
above 


Errors  in  Segmentation 

Wu  can  separate  segmentation  errors  into  three  types, 
errors  of  extra  'egments,  miss.ng  segments,  and  transition 
mdentif icauon  The  probably  c'led  o'  an  error  upon  a speech 
understanding  system  and,  specifically,  the  labeling  process,  will 
vary  considerably  with  the  type  of  error. 

Extra  segments  --  We  ‘ave  biased  the  threshold  tra  n rg 
towards  thresholds  that  will  produce  too  many  segment  boundaries. 
These  errors  are  not  very  serious  since,  if  a sequence  of  short 
segments  are  labeled  with  similar  label'  that  indicate  a sustained- 
type  phonetic  situation,  then  the  segments  may  be  combined  and 
the  labels  collected  by  a voting  scheme  s.milar  to  the  one  used  to 
combine  individual  intervals.  The  most  common  occurance  of  this 
phenomenon  is  during  silence  segments.  The  other  common  situahon 
is  during  gradually  changing  sustained  segments,  usually  trailing  off 
into  silence  at  the  end  of  a phrase  These  may  also  be  detected  by 
the  characteristic  short  segments  with  related  labels. 

Missing  segments  --  This  is  a more  serious  type  of  error 
since  it  requires,  for  correction,  that  the  rest  of  the  system 
hypothesize  the  existence  of  the  missing  segment.  In  addition,  it 
causes  the  labeler  to  combine  information  from  two  segments  that 
are  acoustically  similar,  but  do  differ  somewhat  Very  often,  the 
errors  that  seem  to  be  of  this  type  are  actually  indications  of  a 
case  where  a phoneme  boundary  "exists"  but  no  phoneiic  change 
occurs.  Manual  segmentations  often  contain  such  boundaries,  and 
we  must  rehe  upon  the  higher  levels  of  analysis  to  postulate  such 
non-acoustic  divisions  of  the  utterance  Most  of  the  significant 
problems  seem  to  occur  at  glide  vowel  junctures.  This  appears  to 
us  to  be  the  kind  of  problem  that  can  be  dealt  with  after  some 
initial  labeling  has  occured.  If  we  have  located  a sonorant  segment 
with  glide  and  vowel  characteristics,  we  may  invoke  a formant 
tracker,  or  a specialized  segmenter  that  understands  the  parameter 
space  as  it  relates  to  the  classes  in  question.  It  may  make 
considerable  demands  upon  system  resources,  because  it  is  only 
used  when  needed. 

Transition  identification  --  It  is  reasonable  to  treat  every 
change  from  one  sustained  segment  to  another  as  a transition 
segment.  We  have  attempted  to  identify  only  those  transitions  which 
occur  for  a significant  length  of  time  Since  this  is  a subjective 
quality,  there  can  be  no  absolute  measurement  of  correctness.  What 
we  have  observed  is  that  the  transition  finding  process  seems  to 
help  in  some  cases  where  boundaries  should  be  located  at  the 
beginning  or  end  of  change  rather  than  at  the  point  of  greatest 
change,  and  it  does  not  hurt  in  most  other  cases  Clearly,  more 
accurate  transition  identification  couid  be  done  using  the  labeler 
output  at  a higher  level  in  the  system. 

Errors  in  Labeling 

The  errors  encountered  in  our  attempt  to  do  phonetic 
labeling  will  be  less  critical  if  there  is  information  available  to  the 
speech  understanmng  system  to  correct  those  errors  when  other 
constraints  indicate  that  the  initial  label  choice  is  wrong.  We 
presented  a simple  way  of  providing  this  information  by  return  ng 
the  top  few  iabels  as  they  were  rated  over  the  center  half  of  the 
segment.  The  app'oach  in  Hearsay  II  (Lesser74)  will  be  to  extract 
features  from  this  list  of  labels,  however,  other  uses  could  be  made 
as  well.  One  should  consider  a labeling  algorithm  good  if  the  correct 
phonetic  ,abel  occurs  in  the  top  few,  and  especially  if  it  is  strongly 
reinforced  by  phonetically  similar  labels. 

Some  labeling  errors  occur  because  the  segmenter  has 
failed  to  separate  two  different  segments.  Usually  some 
characteristics  of  each  can  be  seen  in  the  labels,  but  the  confusion 
can  be  serious.  Most  errors,  however,  are  direct  results  of  the 
inadequacy  of  the  parameters  to  represent  the  acoustic  "difference" 
as  ? simple  distance  Goldberg  performed  preliminary  rating  of  some 
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parametric  representations  and  simple  distance  measures 
(Goidberg73).  The  results  are  not  unexpected  --  spectral 
envelopes  did  fairly  well,  for  example,  as  did  a generalized 
quadratic  classifier  based  on  assumptions  Of  normality.  The 
interesting  point  is  that  the  best  results  fell  into  the  range  of 
human  performance  shown  by  Shockey  and  Reddy. 

Conclusions 

We  have  shown  that  the  same  uniform  algorithms  may  be 
used  to  produce  segmentation  and  labeling  from  quite  different 
parametric  representations  of  the  speech  signals.  The  ability  to 
make  comparisons  is  thus  made  available  The  algorithms  are  simple 
m form,  and  thus  easily  implemented  in  hardware.  Their 
performance,  while  not  a the  state  of  tue  art,  is  not  far  behind  it. 
We  wouid  recommend  a "front  end"  of  such  methods  for  a 
straightforward  speech  system.  Such  systems  will  be  desired  to  test 
new  ideas  for  higher  levels,  to  provide  man-machine  communication 
in  highly  constrained  tasks,  and  to  test  basic  changes  in  system 
structure 

Our  p.ans  include  the  application  of  these  algorithms  to 
ether  parametric  representations  than  we  have  presented  here.  A 
compantive  evaluation  is  being  made  of  a variety  of  parametric 
representations  for  their  ability  to  support  segmentation  and 
labeling. 
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Abstract 

Time  domain  analysis  has  proven  quite  useful  for 
revealing  meaningful  acoustic  transients  in  human  speech. 
Although  many  of  these  transients  are  both  quiie  brief  in 
duration  and  low  in  amplitude,  they  occur  consistently  in 
connected  speech  This  paper  outlines  the  kinds  of  analyses 
performed  and  their  resutts  pertaining  to  the  fricatives  and 
stop  consonants. 

This  paper  describes  the  results  of  applying  our  new 
time-domain  techniques  to  the  analysis  of  complex  waveforms, 
in  this  case  human  speech,  Their  chief  advantage  is  precise 
temporal  resolution  allowing  exact  timing  of  articulatory  events 
within  a sample  of  speech;  that  is,  no  bandwidth  limitation  is 
present.  This  temporal  resolution  is  most  significant  for 
characterising  fast  transitional  regions  such  as  occur  at  vowel- 
conscnant  and  consonant-vowel  boundaries  and  within  stop 
consonants.  In  addition,  certain  characteristics  of  these 
regions  are  either  greatly  enhanced  or  uniquely  apparent  in 
the  time-domain.  Such  information  is  revealed  in  our  visual 
displays  generated  from  the  speech  waveform  up-crossings  in 
time.  We  call  these  log  inverse  period  (LIP)  plots. 

The  impetus  for  this  work  comes  from  two  sources: 
DFirst  are  the  studies  by  Licklider  and  his  colleagues  (5,6,) 
who  25  years  ago  demonstrated  the  intelligibility  of  infinitely 
clipped  speech.  This  showed  that  sufficient  acoustic  speech 
information  is  encoded  in  the  zco-crossings  of  the  waveform 
itself.  Given  the  redundancy  of  speech  such  information  is 
most  probablv  encoded  by  other  aspects  of  the  waveform.  As 
it  happens  t rough,  zero-crossings  or  up-crossings  are  easy  to 
see  and  extract  from  the  waveform.  2)The  second  motivation 
for  this  work  comes  from  neurophysiological  research  on  the 
auditory  information  processing  of  the  ear  itself.  Basically 
the  ear  processes  an  incoming  signal  in  at  least  two  widely 
recognized  manners.  The  first  is  analysis  in  the  frequency- 
domain  and  is  analgous  to  a kind  of  filter  bank  where  different 
neurons  along  the  basilar  membrane  respond  to  different 
frequency  ranges;  that  is,  a given  neuron  fires  if  it  detects  a 
signal  of  sufficient  intensity  within  a particular  frequency 
range.  Neurons  also  code  information  in  the  time-domain  in  a 
manner  known  as  phase-locking  (4,8).  Given  a signal 
waveform,  a phase-locking  neuron  responds  by  firing  once, 
phase  consistently,  for  each  cycle  or  integer  number  of  cycles 
within  the  waveform.  The  technique  we  ?re  using  is  directly 
anatagous  to  this  latter  time-domain  coding  technique. 

We  generate  our  visual  displays  as  follows:  A zero-axis 
is  drawn  horizontally  through  the  center  of  the  acoustic 
waveform.  We  note  the  exact  time  when  the  waveform 
crosses  this  axis  in  an  upward  direction.  In  actuality,  we 
usually  record  only  those  up-crossings  which  exceed  some 
threshold  amplitude,  epsilon,  set  slightly  above  the  horizontal 
zero-axis  This  threshold  tends  lo  preclude  low  amplitude 
background  Oise  We  measure  each  interval  between 
successive  up  crossings  and  plot  these  as  a function  of  time  in 
our  displays.  Therefore  each  up-crossing  in  the  acoustic 
waveform  is  represented  by  a discrete  dot  in  our  displays.  In 
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fact,  we  actually  plot  on  a log  scale,  the  inverse  of  the 
interval  between  successive  up-crossings,  the  period  of  the 
cycle,  along  the  vertical  y-axis  and  time  along  the  horizontal  x- 
axis.  This  yields  a display  which  superficially  resembles  a kind 
of  spectrographic  display.  (T  B.  For  those  readers  familiar  with 
neurophysiological  slud  es  of  single  unit  responses,  this  display 
is  analagous  to  an  “instantaneous  frequency"  plot  and 
functionally  analagous  to  a phase-locking  phen„  nenon.)  We 
also  display  a ,ough  intensity  measure  by  means  of  a z-axis 
modulation.  That  is,  the  size  of  a dot  representing  a given 
cycle  is  proportionate  to  the  log  of  the  greatest  amplitude 
achieved  during  that  cycle.  This  dot  size  intensity  measure  in 
Our  up-crossing  displays  is  analagous  to  the  intensity  measure 
expressed  in  spectrograms.  The  following  illustration  shows 
the  relationship  of  the  log  inverse  period  plot  to  the  waveform 
from  which  it  is  generated  Note  that  individual  cycle- 
frequency  values  may  be  easily  read  from  the  y-axis. 


The  idea  of  looking  at  zero-crossing  measures  per  se  is 
not  in  itself  conceptually  new.  However,  in  contrast  to  most 
c ther  investigators  (2, 3, 7, 9)  who  have  used  zero-crossing 
measures  to  analyze  speech,  we  do  not  average  our  up- 
crossings  over  a fixed  intervat  of  time  Reasons  for  this  will 
be  discussed  shortly.  First  of  all  it  is  important  io  be  aware 
that  the  chief  motivation  for  many  zero-crossing  studies  has 
been  in  searching  for  an  inexpensive  way  to  find  frequency 
domain  acoustic  features,  such  as  formants  This  method 
avoids  the  compulations  required  for  Fourier  transforms,  for 
example.  In  order  to  decrease  the  expense  and  variability  in 
examining  individual  cycles,  it  was  easy  to  to  compute  an 
average  cycle  length  by  simply  counting  the  number  of  zero- 
crossings  occurring  during  a given  time  interval.  This 
procedure  has  two  major  consequences:  1)  the  perfect  time 
resolution  inherent  in  the  time-doma  n is  lost  when  crossings 
are  averaged;  that  is,  a bandwidth  limit al ion  is  introduced,  2) 
the  conventional  acoustic  features  exlraded  are  usually  less 
precise  and  more  variable  than  the  same  acoustic  features 
extracted  directly  with  a frequency-domain  analysis.  Ou 
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reason  lor  not  averaging  up  crossings  is  that  in  the  speech 
waveform  it-.e. f there  are  s'gmhcant  acoustic  teatures  which 
last  for  onl>  one  or  a fev'  cycles  in  duration  It  cycles  are 
averaged,  this  information  is  irrevocably  lost.  Such  transient 
events  frequently  occur  at  vowel-consonant  and  consonant- 
vowel  boundaries  as  well  as  between  other  acoustically 
distinct  regions,  within  stop  consonants  for  example. 


The  total  amount  of  data  examined  during  the  course  Of 
this  investigation  consisted  of  several  thousand  utterances,  in 
both  citation  form  as  well  as  connected  speech,  spoken  by 
more  than  20  male  and  female  speakers,  often  in  noisy 
environment^ . The  set  of  this  data  which  has  been  studied 
most  thoroughly  cons  sts  of  684  utterances  in  citation  form, 
generously  provided  by  June  Shoup.  Each  ot  the  speakers  (2 
male  and  1 female  ) spoke  228  utterances  chosen  designed  to 
provide  examples  ot  all  the  allophones  of  the  fricatives  and 
stop  consonants  common  in  the  English  language,  as  described 
in  June  Shoup’s  Ph  D thesis,  1964  (10). 

Each  of  these  three  sets  of  recordings  was  digitally 
sampled  at  20  k‘iz.  Then  a number  of  time-domain  measures 
were  computed  from  these  digital  files.  The  accura-  ■ of  such 
measurements  is  of  course  limited  by  the  50  mi  rosecond 
resolution  of  the  sampling.  However  linear  inte,  polation 
between  two  successive  samples  was  routinely  pertormed  to 
more  accurately  pinpoint  the  time  of  waveform  up-crossings. 
The  time  for  each  wavetorm  uperossing  was  computed  and 
used  to  calculate  the  inverse  period  for  each  cycle  in  the 
waveform.  Various  amplitude  measures  were  computed  for 
each  cycle  as  well  as  several  measures  of  the  amount  of 
microstructure  riding  on  each  cycle  Each  of  these  three  types 
of  time-domain  parameters  have  proved  to  be  quite  useful. 
Then  with  alt  of  these  parameters  available,  a cycle-by-cycle 
hand  analysis  ot  the  waveforms  tor  all  684  utterances  was 
pertormed  in  order  to  precisely  mark  the  time  at  which  sharp 
discontinuities  in  one  or  more  of  these  parameters  delineated 
the  acoustically  distinct  segments  which  occur  internally  in 
fricative  and  stop  consonants  This  precise  segmentation 
required  correlation  of  the  time-domain  parameter  values  with 
the  LIP  plots  anu  expanded  waveforms.  Statistics  on  each  of 
these  acoust  segments  were  then  computed  with  respect  to 
each  of  18  linear  and  logarithmic  time-domain  parameters.  In 
all  there  were  23  different  statistical  tests  per  ormed  on  the 
individual  acoustic  segments  for  each  fricative  ana  stop 
consonant.  These  tests  included  finding  the  number  of  cycles 
in  the  sample,  the  mean,  maximum  and  minimum  values, 
standard  deviation,  bimodal  distribution  etc.  In  addition, 
where  values  ot  individual  cycles  within  a given  segment  were 
more  than  2 standard  deviations  from  the  mean  for  the  whole 
cet,  these  cycles  were  eliminated  and  statistical  measures,  as 
described  above,  were  computed  for  the  remaining  set  ot 
cycles.  Also,  tor  each  segment  a least  squares  linear  fit  was 
computed  and  its  values  at  the  beginning  and  end  of  the 
segment,  respectively,  were  derived.  These  tatter  measures 
are  particularly  usetul  for  indicating  whether  a given  segment 
is  relatively  steady  state  and  how  great  a discontinuity  occurs 
at  the  end  of  one  segment  compared  with  the  beginning  of  the 
next 

Fricatives 

Generally  fricatives  are  acoustically  characterized  as 
sustained  high  frequency  regions.  In  voiced  fricatives,  this 
high  frequency  region  is  preceded  by  a low  frequency  region 
which  may  persist  throughout  the  high  frequency  region  as 
well.  Time  domain  analysis  reveals  that  at  the  beginning  of 
the  high  frequency  portion  of  the  fricative,  there  is  a very 
■■harp  discontinuity  simultaneously,  upward,  for  both  cycle 
frequency  and  microstructure  and  often  a decrease  in 
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amplitude  where  the  fricative  is  preceded  by  a vowei  These 
are  all  large  changps  which  are  'ually  sustained  fo'  the 
duration  of  the  fricative.  Usually  at  the  end  of  the  fricative, 
sharp  discontinuities  are  again  observed.  However  a much 
more  transient  kind  ot  acoustic  feature  often  occurs  at  the 
very  beginning  and  again  at  the  end  of  the  fricative.  At  these 
places  is  found  one  or  a few  cycles  characterized  by  lower 
cycle-frequencies  than  those  of  the  other  cycles  in  the 
acoustic  segment  immediately  preceding  and  the  acoustic 
segment  immediately  following  this  transitional  phenomena 
Amplitude  of  these  cycles  is  variable  and  cycle  microstructure 
is  usually  low.  These  transition  cycles  are  marked  "t"  in  the 
LIP  plots.  Regions  of  frication  are  marked  ”f"  and  for  voiced 
fricatives,  the  initial  lower  fr  Muency  region  is  marked  V'. 
Each  line  of  waveform  represents  .1  sec  of  the  speech  signal 
analyzed.  Similarly,  the  x-axis  of  the  LIP  plots  is  marked  at 
.1  sec  intervals.  The  first  example  is  an  /s/  from  the  utterance 
“there  sir”  (female  speaker,  HN). 


t 


The  second  example  shows  the  voiced  fricative  /v/  in 
the  utterarce  "invent"  (male  speaker,  EH). 
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Acoustically,  stop  consonants  usually  have  a pause 
portion  followed  by  a higher  frequency  region  which 
represents  the  stop  consonant  release  region,  plus  aspiration 
if  present.  A voiced  stop  consonant  has  a low  frequency  or 
voicing  region  just  preceding  the  pause  pnrtioa  Often  these 
lower  freqencies  are  sustained  throughout  the  release- 
aspiration  region  as  well.  And  it  is  not  uncommon  for  the 
pause  cycles  to  be  completely  omitted  in  a voiced  stop 
consonant. 

As  the  waveform  transitions  from  prior  context  or  the 
initial  voicing  region  in  voiced  stop  consonants,  the  cycle- 
frequency,  amplitude  and  microstructure  drop  sharply. 
Although  this  pause  portion  lasts  only  one  or  a few  cycles,  the 
cycle-frequencies  are  quite  low,  often  less  than  100  Hz.  The 
dots  representing  these  low  cycles  are  visually  quite  obvious 
in  the  LIP  plots.  Next,  as  the  waveform  transitions  abruptly 
into  the  release-aspiration  region,  both  cycle-frequency  and 
microstructi:  e measures  increase  sharply  as  does  amplitude, 
which  nonetheless  at  its  peak  value  generally  remans  well 
below  the  average  level  for  unstressed  vowels.  Where 
asniration  is  present,  the  transition  from  release  to  aspiration 
is  often  smooth  with  cycle-frequencies  and  amplitude  gradually 
decreasing. 

In  the  LIP  plots  shown  here,  pause  cycles  are  marked 
"p",  the  release-aspiration  region  by  ”r”,  and  the  initial  voiced 
region  of  voiced  stop  consonants  by  "v".  The  following 
example  is  of  the  /t / in  the  utterance  "the  till"  (male  speaker, 
EH). 


Time-domain  analysis  also  reveals  the  existence  of 
several  more  subtle  acoustic  phenomena  which  have  previously 
gone  unrecognized.  These  phenomena  are  often  both  short  in 
duration  and  low  in  amplitude.  They  occur  often  at  phone 
boundaries  are  last  for  only  one  or  a few  cycles  in  the 
acoustic  waveform. 

The  first  of  tuese  is  analagous  to  the  transitional  cycles 
previously  described  for  fricatives.  At  the  end  of  the  release- 
asoiration  region  of  the  stop  consonant,  there  is  often,  though 
not  always,  one  or  a few  cycles  which  have  lower  cyde- 
freauencies  than  any  of  the  other  cycles  in  either  of  the 
acoustic  segments  immediately  preceding  and  following  this 
acoustic  event.  These  transitional  cycles  are  marked  as  "t“  in 
the  LIP  plots  which  follow. 

The  second  phenomenon  is  very  common  and  shall  be 
referred  to  as  a "stop  preview".  In  the  case  of  a stop 
consonant  which  is  preceded  by  a vowel  (and  sometimes  by 
other  phone  types  as  well),  the  very  end  of  the  vowel  is 
acoustically  characterized  by  one  or  two  cycles  with  much 
higher  cycle-frequencies  than  any  of  the  other  cycles  which 
comprise  the  vowel.  These  stop  preview  cycles  are  very  low 
in  amplitude.Thcir  duration  is  almost  always  less  than  1 msec 
and  very  commonly  less  than  .5  msec.  In  the  LIP  plots,  these 
are  marked  as  "sp". 

The  third  phenomenon  concerns  the  one  or  two  cycles 
immediately  preceding  the  stop  preview  These  one  or  two 
cycles  are  usually  of  relatively  large  amplitude,  but  have  a 
lower  cycle-frequency  than  those  of  the  cycles  preceding  it. 
Only  at  the  very  beginning  of  the  vowel  are  there  cycles  with 
cycle-frequencies  as  low  or  lower  than  the  cycles  immediately 
preceding  the  stop  preview.  Although  these  stop  preview 
transitional  eye  js  are  sometimes  omitted  when  the  stop 
preview  is  present,  they  have  n been  observed  when  the 
stop  preview  itself  is  absent.  They  are  marked  as  "spt"  in  the 
UP  plots. 

Illustrative  examples  of  all  these  phenomena  are 
provided  in  the  utterances  1)  "to  do"  and  2)  “he  grows” 
(female  speaker,  HN). 
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G cr  thal  neurons  in  the  ear  as  those  in  the  other 
■i-nsory  mortalities,  often  respond  most  vigorously  to  sharp 
discontinuities  of  the  incoming  signal,  it  is  intriguing  to 
speculate  On  the  information  provided  by  this  common  stop 
preview  phenomenon  and  when  it  may  be  most  useful.  Its 
most  obvious  aspect  is  the  cue  i!  provides  t ha  a stop 
consonant  follows.  It  is  conceivable  that  espec  ally  in 
connected  speech  where  stop  consonants  are  often  verv  brief, 
such  redundancy  of  their  presence  may  be  helpful  fo  stop 
consonant  detection. 

Allophones  and  Acoustic  Correlates 

Using  time-  dmain  analysis,  it  is  easy  to  compute,  for 
example,  characteristics  of  a /p/  release  and  compare  these  to 
those  of  a / 1 / release.  Certain  general  attributes  become 
readily  apparent.  For  example,  the  cycle-frequencies  of  the 
/p / release  are  much  more  diffuse  than  those  of  the  more 
concentrated  /h/  release  A /t / release,  in  comparison  to  both 
of  these,  usually  has  more  energy  concentrated  at  much  higher 
cycle-frequencies.  Giver-  the  same  context,  these  attributes 
and  other  time-domain  parameters  are  quite  useful  for 
consistently  distinguishing  between  /p/,  /t/,  and  /K/  However, 
the  acoustic  correlates  of  the  release  of  a particular  stop 
consonant  in  a given  environment  are  often  quite  changed 
when  this  same  phone  occurs  in  a different  contexl. 
Coarticulation  effects  thereby  give  rise  to  many  allophones. 

In  the  following  examples  are  shown  two  allophones  of 
the  phone  /k/,  one  rounded  and  one  not.  The  utterances 
containing  these  are,  respectively,  l)"pawn  to  queen  four"  and 
2)  "pawi  to  king  four"  (male  speaker,  JB).  The  release 
portion  of  the  the  rounded  /k/  of  "queen"  is  characterized  by 
much  lower  cycle-frequencies  than  the  release  portion  of  the 
/k/  in  "king".  Rounding  of  the  lips  causes  the  vocal  tract  to  be 
lengthened  thereby  lowering  the  cycle-frequencies  emitted. 
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These  examples  demonstrate  the  importance  Of 
understanding  coarticulation  effects  in  the  task  of  recognzing 
individual  phones  from  acoustic  information. 

N.B.  Readers  interested  in  the  specific  acoustic 
correlates  to  the  allophones  of  fricatives  and  stop  consonants 
are  referred  to  the  author's  Ph.D.  thesis,  1974(1). 


Acoustic -Phonological  Phenomena 
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Another  kind  of  event  commonly  occurs  during  the 
release  portion  of  stop  consonants  In  the  acoustic  waveform, 
this  portion  is  character-ed  by  amplitude  pulsing.  The  cycle- 
frequency  composition  of  each  of  these  pulses  resembles  that 
of  the  normal  release  portion  of  the  same  stop  consonant 
when  such  amplitude  pulsing  is  not  present.  Where  aspiration 
occurs,  it  follows  this  amplitude  pulsing,  as  it  would  a normal 
release.  The  following  examples  Of  both  waveforms  and  LIP 
plo  s show  such  amplitudo  pulsed  /k/s  in  the  utterances  1) 
"scak  me"  and  2)  “soak  to”  (male  speaker,  JA). 


There  are  a variety  of  acoustic  phonological 
phenomena  which  are  commonly  observed  with  time-domain 
an-',  3is.  Generally  these  phenomena  are  readily  apparent  in 
boln  the  time-domain  waveform  and  log  inverse  period  plots. 
However  especially  when  such  acoustic  events  are  either  very 
low  in  amplitude  or  very  brief  in  duration,  or  both,  their 
existence  is  much  more  visually  evident  in  in  the  log  inverse 
period  plots. 

One  very  common  phenomenon  is  the  case  where  a 
fricative  is  characterized  by  a central  region  where  the  cycle- 
frequencies  are  lowered  in  relation  to  that  phone’s 
characteristic  frication  frequencies.  In  the  following  example  , 
the  phone  of  interest  is  a rounded  /(/  in  the  utterance  “no 
foe"  (female  speaker,  HN).  The  central  mean  frequency  for  the 
initial  fricatod  region  is  1079  Hz,  for  Ihe  central  region  is 
693  Hz,  and  for  the  final  fricated  region  is  1148  Hz.  In 
addition,  the  first  fricated  region  is  much  greater  in  amplitude 
than  the  central  and  final  regions  which  are  about  equal  in 
amplitude 
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The  next  phenomenon  regards  the  issue  of  the  oustic 
correlates  of  what  are  commonly  referred  to  as  "unreltased" 
stop  consonants.  Time-domam  analysis  reveals  that  otten  the 
sloo  consonants  which  are  phonetically  transcribed  by  linguists 
a.,  "unreleased"  or  omitted,  are  acoustically  characterized  by 
the  usual  pause  eyelets),  but  with  a very  brief  segment  ot  high 
frequency  energy  which  is  analagous  to  a normal  release 
segment,  and  which  is  sometimes  followed  by  the  transition 
cyde(s)  leading  into  the  next  phone's  acoustic  events.  This 
very  briet  segment  consists  ot  only  a tew  cycles,  Often  just 
one  or  two  cycles  where  the  the  entire  duration  of  this  portion 
may  be  so  rhort  as  to  be  less  than  1 msec,  and  rarely  more 
than  6 m>-ec  long  The  temporal  sequence  of  acoustic  events 
characterizing  these  unreleased  stop  consonants  is  usually 
identical  with  that  tor  released  stop  consonants  except  tor 
duralional  aspects.  The  few  cycles  with  high  cycle- 
frequencies  remaining  in  unreleased  slop  consonants  are 
unreliable  indicators  for  specific  identification  ot  the  stop 
consonant  However,  the  information  that  a stop  consonan*  l,«s 
occurred  and  whether  or  not  it  was  voiced  does  remain  in  most 
cases  The  following  example  shows  such  an  unreleased  stop 
consonant,  the  /b/  in  the  utterance  "tub  took"  (male  speaker, 
FH).  in  this  particular  example,  the  segment  with  high  cycle- 
frequencies  is  relatively  long  in  duration,  4.2  msec,  and  is 
composed  ot  7 cycles  preceded  by  normal  voicing  and  pause 
cycles. 


v'y^yv'i/vv' 
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Another  example  follows  where  the  same  kind  of 
acoustic  event  occurs  during  the  course  of  4 cycles  lasting  a 
total  of  2.3  msec.  It  occurs  for  the  /k/  immediately  preceding 
the  / 1/  in  the  word  "spectrogram"  (temale  speaker,  SM).  Such 
short  duration  acoustic  events  are  quite  common  in  connected 
speech. 
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Another  very  common  phonological  phenomenon 
relates  to  the  insertion  o(  an  extra  stop  consonant.  This  occurs 
when  a syllable  ends  with  a stop  consonant  and  the  next 
syllable  begins,  with  a vowel  (or  sometimes  a liquid),  even 
when  there  is  a word  boundary  .^parating  the  two  syllables 
The  speaker  otten  articulate  a r-nal  stop  consonant  at  the 
end  of  the  first  syllable  as  ejected,  but  then  repeats  this 
same  stop  consonant  when  he  begins  the  next  syllable.  When 
this  happens  it  is  not  obvioi  s to  a human  listener  th.t  a 
second  stop  consonant  has  been  inserted  by  the  speaker  In 
the  following  example  of  the  utterance  "about  Israel"  (male 
speaker,  JB),  the  /t / in  "about"  is  repeated,  even  after  a long 
interword  pause  ot  17  sec,  at  the  beginning  ot  the  initial 
vowel  in  the  word  "Israel"  Acoustically  both  /t/s  are 
complete  in  all  respects. 
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unreleas.H  T y'  " b°'h  0t  ,hese  'Stances,  an 

unreleased  stop  consonant  is  followed  by  another  stop 

consonant  which  is  released.  Acoustic  observations  07  a very 
bnet  stop  consonant,  otten  indicate  that  another  stop 
consonant  will  immediately  follow  P 
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SUB-LEXICAL  LEVELS  IN  THE  HEARSAY  II  SPEECH  UNDERSTANDING  SYSTEM 
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ABSTRACT 

The  HEARSAY  II  system  provides  a unitorm  multi-level  structure  for  representing  the  partial 
analysis  ot  the  utterance  as  it  is  being  recognized  ai.d  a convenient  modular  structure  for 
incorporating  new  Knowledge  (i.e.,  processing  capabilities)  into  the  system  at  any  level.  This 
paper  describes  the  sub-lexical  levels  chosen  tor  the  initial  configuration  (parametric,  segmental, 
phonetic,  surface-phonemic,  syllabic)  ot  the  system  and  the  kind  ot  processing  that  is 
accomplished  at  those  levels.  The  choice  ot  levels  is  related  to  traditional  phonological  theories. 


INTRODUCTION 

The  HEARSAY  II  (HSII)  speech  understanding  system  (whose 
system  organization  is  described  more  completely  m Lesser,  et  al., 
1974)  provides  a unified  structure  for  describing  an  utterance  as  it 
is  being  analyzed.  This  structure  may  be  thought  ot  as  3- 
dimensional,  with  the  dimensions  being  level  ot  representation 
(e.g.,  acoustic,  phonetic,  lexical,  syntactic),  time,  and  alternative 
possibilities  This  structure  is  held  as  a single  data  base  which  the 
system  maintains.  HSII  also  provides  a means  tor  introducing 
Knowledge  sources  (realized  as  computer  programs)  to  work 
towards  recognition;  the  Knowledge  sources  (KS's)  cooperate  by 
examining  and  modifying  this  global  data  structure  in  a generalized 
•orm  ot  hypothesize-and-test. 

Earlier  speech  recognition  systems  have  suffered  from 
problems  with  internal  levels  of  representation;  in  general,  they 
ha /e  no  ^ clear  distinction  among  such  concepts  as  "acoustic", 
phonetic”,  "phonological",  "phonemic”,  etc  * The  major  difficulties 
caused  by  this  fuzziness  ot  representation  are  the  inability  to 
decompose  the  system  so  as  to  allow  useful  oerformance  analysis 
ot  the  various  sources  of  Knowledge  and  the  inability  to  make  use 
of  results  obtained  by  linguists  and  phonologists  working  along 
traditional  ,ines.  The  HSII  system,  on  the  other  hand,  does  not 
pre-specify  the  set  ot  levels  used  in  the  data  structure  nor  the 
set  of  knowledge  sources;  a particular  system  configuration  is 
generated  by  defining  the  levels  to  be  used  and  creating  the 
knowledge  sources  to  operate  over  them.  Because  the  levels  ot 
representation  are  uniform  and  must  be  explicitly  defined  well 
enough  for  the  KS’s  to  interact  through  them  in  an  independent 
manner,  there  is  much  more  need  and  motivation  to  choose  and 
delineate  them  in  a less  ad  hoc  manner  than  in  previous  systems. 

This  paper  1)  describes  the  choice  of  the  sub-lexical  levels 
m the  initial  configuration  (called  HSH-CO)  which  is  being 
implemented  as  the  first  test  ot  HSII,  2)  gives  some  feel  for  the 
kinds  ot  processing  occurring  at  and  between  those  levels,  and  3) 
relates  those  levels  to  traditional  phonological  theory. 


THE  LEVELS 

The  HSII-CO  configuration  has  five  levels  "below"  the  lexical 

level: 

(6.  Lexical) 

5.  Syllab.c 

A.  Surface-Phonemic 
3.  Phonetic 
2.  Segmental 
1.  Parametr  c 

At  each  level,  a (potentially  complete)  representation  of  the 
utterance  is  formed,  composed  of  uniis  appropriate  to  the  level. 

At  the  parametric  level,  the  speech  IS  represented  by  vectors 
ot  parameters  (e.g.,  spectral  parameters),  typically  sampled, 
for  example,  every  ten  milliseconds. 

At  the  segmental  leyeL  the  utterance  IS  described  as  being 
composed  of  labeled  acoustic  segments.  Each  segment 
represents  an  acoustically  homogeneous  section  ot  speech 
(or  a transihonal  segment)  and  is  labeled  in  a way  that 
describes  its  acou  characteristics. 

At  tne  phonetic  IfiVfiL  the  utterance  is  represented  by  a 
phonetic  description.  This  is  a broad  phonetic  description 
in  the  sense  that  some  acoustically  dissimilar  elements  are 
grouped  into  perceptual  units  (e.g.,  silence  ♦ burst  ♦ 
aspiration  may  be  represented  by  a single  plosive 
symbol);  it  is  a fine  phonetic  description  in  the  sense  that  it 
is  possible  to  specify  articulatory  modifications 
(retroflexion,  nasalization)  and  degree  ot  stress. 

The  mriiCB~BhPnemi£  1ml  represents  the  utterance  in  units 
which  can  bs  thought  of  as  phoneme-sized,  with  the 
addition  of  modifiers  such  as  stress  and  boundary  (word, 
morpheme,  syllable)  markings. 


The  syllabic  Imi  represents  an  utterance  as  being  composed 
of  syllables. 


Two  prime,  but  by  no  means  exclusive,  examples  ot  this  problem 
are  the  direct  ancestors  ot  HEARSAY  II:  the  Vicens-Reddy 
system  (Vicens,  1959)  and  HEARSAY  I (Reddy,  et.  al , 1973a 
1973b).  


icvei,  mere  is  an  laenucai  connection  structure  which 
allows  the  representation  of  sequences  and  (competing! 
alternatives.  In  addition,  structural  connections  are  also  made 
across  levels,  relating  how  the  elements  at  one  level  serve  to 
support  hypotheses  at  other  levels. 
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PROCESSING 

A knowledge  source  operates  by  reacting  to  a (sub-) 
structur  * built  in  the  global  data  base  by  another  K$;  it  adds  new 
element*,  at  some  level  or  adds  new  connections  between  existing 
units  This  operation  ot  a knowledge  source  is  triggered  directly 
by  the  change  to  the  structure,  not  by  the  other  K$.  Thus,  a KS  is 
not  aware  ot  other  knowledge  sources,  but  rather  specifies  the 
kinds  of  sub-structure  and  changes  to  which  it  desires  tc  react 

At  the  sub-lexical  leveis,  the  general  paradigm  can  be  thought 
ot  as  a rewriting  scheme  a KS  notices  some  structure  and 
rewrites  it  as  a different  structure  In  addition,  it  links  the  initial 
structure  to  its  newly  created  one  Finally,  if  the  new  elements  it 
is  attempting  to  construct  already  exist  (either  previously  created 
by  itself  or  some  other  KS),  then  the  structure  is  not  duplicated; 
rather,  new  connections  are  made  to  the  pre-existing  structure. 

For  simplicity  of  exposition,  the  following  description  of  these 
levels  and  processes  assumes  a bottom-up  approach  and  linkages 
only  between  adjacent  levels,  but  we  will  see  below  that  these 
limitations  are  not  in  the  system 

From  the  oaramelr  c level  to  the  segmental  level,  the  main 
action  is  to  group  acoustically  similar  samples  and  then  label  the 
segments  The  segmentation  scheme  currently  used  in  HS11-C0 
• Goldberg,  et  al , 1974)  is  parameter  independent.  At  present,  the 
parametric  values  for  the  segment  target  labels  are  determined 
from  a corpus  of  continuous  speech  by  one  male  talker,  whic  has 
been  hand  segmented  and  labeled  with  a fairly  narrow  phonetic 
transcription  (using  on  the  order  of  75  labels).  Each  segment 
receives  up  to  five  different  labels,  each  with  a confidence  rating. 

Although  the  segment  labels  used  are  often  also  phonetic 
symbols,  the  level  is  not  intended  to  be  phonetic  --  the 
segmentation  and  labeling  reflect  acoustic  characteristics  and  do 
not,  for  example,  attempt  to  compensate  for  the  context  of  the 
segments  or  atlempt  to  combine  acoustically  dissimilar  segments 
into  (phonetic)  units.  It  is  clearly  necessary  to  improve  on  the 
method  of  target  selection  to  accomodate  speaker  variation. 
Obviously,  these  targets  can  be  establ  shed  for  any  language, 
although  we  have  dealt  exclusively  with  Eng  ish. 

The  segment  labels  are  actually  defined  through  a set  of 
features:  each  segment  is  defined  as  having  a ternary  value  (+, 
or  0).  Other  than  being  ternary,  as  opposed  to  binary,  these 
features  bear  some  resemblance  to  the  the  Jakobson-Fant-Halle 
(195  ) feature  set  The  use  of  features  creates  an  indirectness  of 
reference  which  isolates  the  processing  algorithms  (Knowledge 
sources)  from  any  particularly  chosen  set  of  segment  I abe Is;  thus 
different  paramelric  representations  may  use  different  sets  Of 
labe's  (ie.,  they  need  only  be  defined  in  terms  of  the  feature 
vectors).  This  feature  representation  is  also  a means  of  creating 
an  algebra  for  manipulating  the  segment  labels:  for  example,  the 
five  alternative  labels  assigned  to  each  segment  may  be  combined 
by  combining  their  feature  vector  definitions  (assuming  a value  of 
-1  for  -,  +1  for  +,  and  0 for  0 and  using  the  confidence  measures 
as  weghtings).  The  values  of  individual  features  of  such  a 
combined  vector  may  be  used  directly  (e  g„  to  determine  if  a 
segment  is  "voiced"  or  "nasalized”)  or  the  entire  vector  may  be 
used  to  derive  a new  label,  which  will  tend  to  be  an  "average" 
over  the  input  labels 

Going  to  the  phonetic  level,  the  main  activity  is  hypothesizing 
phones  from  the  labeled  acoustic  segments,  using  the  adjacent 
acoustic  segments  and  previously  recognized  phones  as  context. 
Tms  hypothesization  may  take  several  forms: 

1)  A single  segment  may  be  propagated  as  a single  phone, 


with,  perhaps,  some  relabeling  In  this  case  the  phone  is 
one-to-one  with  its  acoustic  segment.  Moreover,  paMerned 
errors  caused  by  allophonic  overlap  are  dealt  with  here 
For  example,  a rule  at  his  level  could  say,  "nasalized  [OW] 
might  be  velarized  [AX]  if  it  is  found  before  [LJ." 

2)  One  phone  may  be  synthesized  from  several  similar 
adjacent  segmen.s.  This  form  of  combining  can  be  thought 
of  as  a way  of  correcting  errors  of  the  segmenter  that 
require  contextual  information. 

3)  A phone  may  be  synthesized  from  several  similar  adjacent 
segments.  For  example,  a stop  may  be  generated  from  a 
silence  followed  by  a segment  of  noise 

4)  A phone  may  be  generated  from  within  one  or  across  two 
(or  more)  segments.  For  example,  the  sequence  of  [IYn  T]* 
may  become  [IV  N T]  at  the  phonetic  level,  expressing  the 
idea  that  the  phone  N may  be  acoustically  detectable  only 
as  a nasalization  of  the  preceedmg  vowel  However, 
[!vn  N]  would  be  rewritten  as  ju't  [IY  N],  since  the 
nasalization  is  predictable  from  the  environment 

5)  Phones  may  generated  u'.ng  combinations  of  the  above. 

The  broad  phonel  c transcription  at  the  surface-phonemic 
level  is  linked  to  the  dictionary  pronunciations  (from  the  the 
lexcical  level).  This  associaton  process  uses  phonological  rules 
which  rewrite  symbols  at  the  surface-phonemic  level  For  every 
phonetic  element  assumed  to  be  present,  a determination  is  made 
as  to  what  underlying  (phonemic)  element  or  sequence  of  elements 
could  have  generated  it.  For  example,  an  utterance-fina!  [N]  could 
have  been  derived  from  any  of  [N],  [NX],  [N  T],  or  [N  DJ.  Each 
possibility  which  is  generated  is  given  a confidence  rating 
depending  upon  how  strong  the  initial  identification  is  and  upon 
what  support  is  derived  from  environmental  evidence.  Matches 
are  then  made  of  temporal  and  segmental  properties  between 
surface-phonemic  and  lexical  items.  Of  course,  at  this  point  there 
is  strong  interaction  with  the  syntactic  and  semantic  components 
of  the  system  working  from  the  higher  levels. 

The  syllabic  level  is  one  which  receives  only  cursory 
atler  in  in  the  present  implementation  of  the  system.  We  hope  to 
use  it  in  the  future  as  a repository  of  prosodic  information.  Also, 
this  looks  like  a very  promising  level  for  doing  effective  lexical 
retrieval  in  terms  of  the  size  of  the  syllable  unit  in  relation  to  the 
size  of  words. 

It  should  be  understood  that  although  this  structure  has  been 
presented  as  a strict  sequence  of  bottom-up  processing  through 
adjacent  levels,  there  are  no  such  restrictions  in  the  system;  in 
fact,  much  of  the  action  comes  from  top-down  processing  and  level 
skipping  For  example,  if  a word  is  hypothesized  at  the  lexical 
level  which  has  elements  different  than  those  generated  from 
below,  it  is  possible  to  probe  down  through  the  levels,  hunting 
harder  for  evidence  that  substantiates  the  word  hypothesis.  As  an 
example  of  level-skipping,  given  that  an  hypothesizer  at  the 
syntactic  or  pragmatic  level  has  suggested  that  a sentence  may  be 
a 'yes-no  question’,  an  immediate  skip  may  be  made  to  the 
parametric  level  to  investigate  pitch  contours  These  various  kinds 
of  actions  of  top-down,  bottom-up,  and  level-skipping  can  all  be 
happening  simultaneously,  as  the  knowledge  sources  are  executed 
asynchronously  and  in  parallel. 


* We  use  the  notation  [IYn]  here  to  mean  "nasalized  [IY]" 
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I EVELS  AND  PHONOLOGICAL  THEORY 


In  the  past  severa  decades,  two  major  phonological  theories 
have  achieved  prominence  The  phonemic  theory  (Gleason,  1966, 
Hockett,  1955,  Harris,  1951)  is  based  on  the  lenet  .hat  there  are 
discrete  levels  of  analysis  on  the  morphophonen  c,  phonemic, 
aliophonic,  and  phonet  c levels  ana  thal  each  of  these  levels  can 
be  mapped  onto  its  neighboring  levels  by  the  use  ot  a set  of 
distributional  statements  plus  a statement  for  each  segment 
regarding  its  free  var  ation  properties.  In  this  theory,  the  phone 
s the  surface-level  entity:  that  which  is  actually  articulated.  The 
other  levels  are  perceptual  or  statistical  constructs 

This  theory  has  been  attractive  to  builders  of  speech 
recognition  systems  for  two  unrelated  reasons:  1)  its  separate 
eveis  are  relatively  easy  to  deal  with  in  a computer  system  and  2) 
several  influential  people  in  speech  recognition  have  been  trained 
in  and/or  have  contributed  substantially  to  phonemic  theory. 

The  second  general  class  of  theory,  which  has  enjoyed 
popularity  more  recently,  is  called  generative  phonology  (Chomsky - 
Haiie,  1968,  Postal,  1968).  In  general,  it  assumes  only  two  fixed 
levels:  some  sort  o'  underlying  representation  in  abstract  tor 
possible  sequences  of  sounds  in  a given  language  (the  nature  of 
which  is  much  debated),  and  the  phonetic  output  level  (or 
something  very  close  to  it),  Connecting  these  two  levels  is  a set 
of  phonological  rules,  frequently  thought  to  be  ordered,  which 
change  properties  of  segments  and  possibly  add  or  delete 
segments  These  ru.es  can  be  optional  or  obligatory  They  can  be 
compared  to  a series  of  filters:  given  a sequence  of  elements 
destined  to  be  articulated  (including  all  types  of  boundaries),  the 
en'ire  i.tnng  is  fed  iMc  the  first  filter.  If  it  is  able  to  modify  an 
element  or  group  of  elements  in  the  input  string,  it  does, 
otherwise  it  lets  the  string  pass  unchanged.  Of  course,  if  the  tiller 
is  an  optional  one  it  may  or  may  not  be  switched  in  Then  the 
string  is  parsed  to  the  next  filter.  In  general,  alternations 
between  possible  surface  pronunciations  of  a given  base  torm  are 
caused  by  an  optional  rule  having  applied  or  not  applied.  The 
major  point  here  is  that  there  are  very  many  output  levels  tor 
these  filters,  most  of  which  can  be  inputs  to  others. 

At  present,  researchers  trained  in  each  of  these  theories  are 
occupied  in  aulomatic  speech  recogn  tion;  some  are  trained  in  both. 
It  seems  that  a synthesis  of  Ihe  theories,  Or  at  least  an  agreement 
as  to  terminology,  would  be  desirable,  since  workers  in  ASR  quite 
frequently  use  the  'dea  of  distinct  levels  of  analysis  (phonetic, 
a ophonic,  phonemic,  etc.)  but  are  also  interested  in  using 
phonological  rules  m a generative  rather  than  a descriptive  sense 
Perhaps  attempts  at  building  systems,  such  as  HEARSAY  11,  which 
explicitly  span  the  full  range  of  levels  and  make  efforts  at 
conceptual  cleanliness  will  prove  an  incentive  and  test-bed  for 
such  a synthesis 

Due  partially  to  thi.,  mixed  theoretical  tramework,  we 
experience  d'fficul'y  -n  finding  reasonable  terminology  for  at  least 
one  of  our  level*.  The  parametric  and  segmental  levels  seem  to  be 
largely  extra-theoretical,  having  more  to  do  with  theories  of 
speech  perception  than  directly  with  phonological  theory.  The 
term  'phonetic  level'  seems  well-motivated  in  that  this  level 
attempts  to  postulate  a phonetic  transcription  of  the  inpul  or  to 
generate  one  from  a hypothesized  word.  The  syllabic  level  is 
prcbably  more  related  to  acoustic-phonetic  studies,  though  some 
phonologists  use  the  syllable  boundary  in  rule  writing.  But  the 
level  we  call  'surface-phonemic'  is  not  easily  characterized  in 
terms  of  either  of  the  theories  mentioned  above  in  most  cases. 
The  hypotheses  generated  from  below  (typically  the  phonetic 
level)  repre'ent  a proposed  phonemic  transcription  of  just  those 
elements  which  are  identifiable  from  the  speech  input;  the 
hypotheses  genereated  from  above  (eg,  from  lexical  or  syntactic 
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knowledge)  include  most  of  the  possible  alternative  sequences  ot 
aliophonic  tokens  wmch  can  be  related  to  the  dictionary  spe'ling, 
but  represented  very  broadly  This  puts  the  surface -phonemic 
level  on  a theoretical, y non-existent  level  somewhere  between 
aliophonic  and  phonemic  In  generative  terms,  the  'surtace- 
phonemic’  level  is  more  underlying  than  the  output  of  the 
Chomsky-Halle  (1968)  phonological  component  since  it  is  a very 
broad  transcription.  It  is  a form  intermediate  betwetn  underlying 
and  surface  forms;  but  it  is  a level  which  we  tmd  useful  despite  its 
lack  of  theoretical  ancestry 

ACKNOWLEDGMENTS 

We  wish  to  thank  Raj  Reddy,  who  has  inspired  much  of  the 
formulation  ot  these  ideas,  Victor  Lesser  and  Richard  Fennell  for 
their  efforts  in  the  design  of  HEARSAY  11,  which  have  spurred 
much  of  our  efforts,  and  Gary  Goodman,  who  has  been  very 
constructively  reactive  to  our  ideas. 


BIBLIOGRAPHY 


Chomsky,  N and  M Halle  (1968)  Tha  Sound  Pattern  of  English, 
Harper  and  Row,  N.Y, 

Gleason,  H.  A.  (1961),  An  Introduction  to  Descriptive  Linguistics,  Holt, 
Rinehart,  and  Kinston,  N.Y 

Goldberg,  H.  G.,  D.  R.  Reddy,  and  R.  L.  Suslick  (1974),  "Parameter- 
Independent  Machine  Segmentation  and  Labeling,"  Proc.  IEEE 
Symp.  Speech  Recognition,  Pittsburgh,  Pa.,  pp.  106-111,  (this 
volume). 

Harris,  Z.  (1951),  Structural  Linguistics,  University  Ot  Chicago  Press. 

Hockett,  C.  F.  (1955),  A Manual  of  Phonology,  UAL  Memoir  11, 
Waverly  Press,  Baltimore. 

Jakobson,  R.,  G.  Fant,  and  M.  Halle  (1951),  Preliminaries  to  Speech 
Analysis,  MIT. 

Postal,  P.  (1968a),  Aspects  of  Phonological  Theory,  Harper  & Row. 

Reddy,  D.  Raj,  Lee  D,  Erman,  and  Richard  B.  Neety  (1973a),  "A 
Model  and  a System  for  Machine  Recognition  of  Speech,"  IEEE 
Trans  Audio  and  Electroacoustics,  AU-21,  3,  June,  1973, 
pp.  229-238. 

Reddy,  D R , L.  D Erman,  f D.  Fennell,  and  R.  B.  Neely  (1973b), 
"The  HEARSAY  Speech  Understanding  System:  An  Example  of 
the  Recognition  Process?  Proc.  3rd  Inter  Joint  Cont  on 
Artificial  Intel.,  Stanford,  Ca.,  pp.  185-193. 

Vice  ns,  P.  (1969),  “Aspects  ot  Speech  Recognition  by  Computer" 
CS  127,  Al-85,  Comp.  Sci  Dept.,  Stantord  Univ  (Ph  D.  thesis), 
Stanford,  Ca. 


YY13  IEEE  Symp  Speech  Recognition 


* 


INFERENCE  AND  USE  OF  SIMPLE  FnEDICTlVE  GRAMMARS 


Elaine  Rich 

Carnegie  -Mellon  University* 
Pittsburgh,  Pa,  15213 


One  use  ot  syntactic  knowledge  in  a speech 
understanding  system  is  to  focus  the  system  on  the  most 
probable  paths  as  it  is  attempting  to  understand  an  utterance. 
This  function  is  frequently  performed  by  a parser  similar  or 
dentica  lo  the  one  used  to  generate  a parse  of  the  entire 
utterance.  However,  ,t  is  possible  to  perform  a large  part  of  this 
function  without  incurring  the  overhead  of  generating  many 
partial  parses,  most  ot  which  will  eventually  te  thrown  away 
This  is  done  by  using  a simple  probabilistic  grammar  which,  given 
a string  of  already  recognized  words,  can  predict  the  words 
which  can  precede  or  follow  the  string  and  associate  with  each 
such  word  the  probability  that  it  will  occur.  The  system  can  then 
consider  the  most  likely  possibilities  first  It  they  are  rejected 
by  the  lower  level  Knowledge  sources,  then  the  less  likely 
possibi'ities  can  be  considered. 


A knowledge  source  for  the  Hearsay  11  system 
\Lesser, 1 974)  has  been  constructed  to  do  this.  The  data  used  by 
this  syntactic  knowledge  source  consist  primarily  of  a collection 
of  sentence  fragments  of  varying  lengths,  each  of  which  has 
associated  with  it  a list  ot  words  which  can  precede  it  and  a list 
ot  words  which  can  follow  it,  along  with  the  probability  that  each 
of  those  words  will  occur  in  that  environment.  These  fragments 
may  contain  both  specific  words  and  grammatical  classes.  The 
ragments  are  arranged  by  the  word  immediately  adjacent  to  the 
word  to  be  hypothesized.  The  program  uses  a lexicon  and  a 
grammar  which  provide  it  with  the  information  it  needs.  The 
lexicon  contains  an  entry  fc r each  word  in  the  vocabulary  which 
specifies  the  grammatical  category  to  which  the  word  belongs. 

he  grammar  specifies,  for  each  grammatical  category,  the 
fragments  which  begin  and  end  with  that  category  and  the  words 
which  can  adjoin  them  (and  the  probability  associated  with  each 
word).  To  predict  words  at  a given  point  in  the  utterance,  the 
knowledge  source  looks  up  the  word  of  the  partially  recognized 
utterance  which  is  adjacent  to  the  word  to  be  predicted.  Listed 
(or  the  part  of  speech  to  which  it  belongs  will  be  strings  of 
arbitrary  length  starting  with  that  word  (for  predicting  to  the 
left)  and  ending  with  that  word  (for  predicting  to  the  right).  The 
program  uses  the  longest  such  string  which  matches  the 
utterance  fragment  and  predicts  the  alternatives  listed  as 
occurring  on  the  desired  side  of  the  fragment 

Since  storing  long  strings  to  be  used  for  prediction 
incurs  a great  deal  of  overhead,  ooth  in  terms  of  space  and  in 
terms  ot  the  time  required  to  check  for  a match  between  the 
stored  strings  and  a recognized  piece  of  the  utterance,  it  is 
desirable  to  store  long  strings  only  if  the  use  of  additional  words 
causes  a significant  increase  in  the  accuracy  of  prediction. 
Experiments  will  be  conducted  to  discover  when  increasing  the 
length  ot  the  strings  ceases  lo  cause  such  an  increase  in 
performance 


* This  research  was  supported  in  part  by  the  Advanced  Research 
Pro,ects  Agency  rf  the  Department  of  Defense  under  contract 
no  F44620-73-C-0074  and  monitored  by  the  Air  Force  Office 

of  Scientific  Research. 


April,  1974  (CMII) 


p.  242  - 


The  criteria  for  assigning  words  to  grammatical  classes 
in  ‘his  system  are  well  defined  and  are  not  necessarily  the  same 
as  in  the  traditional  grammatical  system  with  nouns  and  verbs 
The  first  criterion  is  to  maximize  the  amount  of  information 
known  about  tie  environment  ot  a word,  given  its  grammatical 
class.  Thus,  words  which  tend  to  occur  in  the  same  environment 
should  br  in  the  same  class.  The  second  criterion  is  the 
restriction  of  the  number  ot  classes  in  order  to  cut  down  on  the 
number  of  senter.ce  tragments  to  be  stored  as  well  as  the 
number  of  possible  alternatives  adjoining  each  of  those  strings. 
A program  is  being  developed  which  will  read  a corpus  of 
utterances  and  construct  grammatical  categories  'rom  the  words 
of  the  corpus  using  the  maximization  ol  information  criterion.  As 
with  the  question  of  how  many  words  should  be  used  to  define 
the  environment,  the  question  ot  how  many  grammatical  classes 
to  use  will  be  answered  empirically  by  observing  the  point  at 
which  the  addition  of  more  classes  does  not  significantly  improve 
the  predictive  ability  of  the  knowledge  source.  The  principal 
problem  in  getting  the  categorization  program  to  do  very  wt  I is 
the  need  for  a corpus  large  enough  so  **-at  each  word  occurs 
enough  times  to  be  able  to  know  what  e onments  it  can  occur 
in. 


The  program  which  constructs  grammatical  classes  can 
also  construct,  from  the  corpus,  the  lexicon  and  grammar  needed 
by  the  knowledge  source.  Thus  it  should  eventually  be  possible 
to  have  the  machine  both  construct  the  grammar  as  well  as  use  it. 
One  result  of  this  is  that  it  should  be  relatively  easy  to  construct 
a grammar  based  on  a new  corpus,  thereby  allowing  the  sytem  to 
recognize  utterances  pertaining  to  a new  task. 


Bibliography 


Lesser,  V.R.,  R.D.  Fennell,  L.D.  Erman,  and  OR.  Reddy,  "Organization 
Of  the  Hearsay  II  Speech  Understanding  System",  IEEE 
Symp.  Speech  Rec.,  April,  1974  (this  volume). 


H5 


IEEE  Symp.  Speech  Recognition 


Mi 


^ 


i 


r 


I 


REAL-TIME  LINEAR-PREDICTIVE  CODING  QF  SPEECH 
QN  XHE  SPS-41  MICROPROGRAMMED  TRIPLE-PROCESSOR  SYSTEM 


Michael  J.  Knudsen 
Carnegie-Mellon  University 
Pittsburgh,  Pennsylvania 


Summary 

Market’s  autocorrelation  method  for  linear  predictive  coding 
of  speech  [1]  has  been  implemented  on  the  SPS-41,  a 
commercially  available  system  composed  of  three  dissimilar 
microprocessors  working  in  parallel.  Using  user-written 
microcode,  one  processor  performs  I/O  and  master  control,  the 
second  handies  loop  indexing  and  counting,  and  the  third  does  the 
actual  arithmetic  on  data.  Such  parallelism  allows  2M  1/0 
operations  and  4M  multiplications  per  second,  but  actually 
realizing  this  potential  requires  fresh  approaches  to  some  old 
algorithms,  e g , a new  autocorrelation  scheme  with  several 
valuable  properties.  Inverting  the  autocorrelation  matrix  in  16 
bits  of  fixed  point  aiso  poses  problems.  The  present  program 
converts  256  words  of  13-0it  samples  into  14  coefficients  at  100 
frames  per  second 


Review  of  Markers  Method 


Motivation 


Our  maior  interest  in  Markel's  method  at  C-MU  is  to  find  the 
resonance  spectrum  of  the  vocal  tract  for  each  frame  of  speech, 
where  a frame  is  about  200-300  samples  for  a 10  kHz  sample 
rate  The  next  step  after  this  (not  covered  here)  is  to  identify 
the  formats  or  otherwise  compare  tne  resonance  curve  with  a 
standard  set  of  corresponding  data  for  v-rious  phonemes,  in  order 
to  identify  the  phoneme  spoken. 

A straightforward  high  resolution  spectrum  of  the  frame  (as 
by  an  FFT)  will  not  do,  as  it  will  have  not  only  the  frequency 
response  of  the  vocal  tract,  but  will  also  superimpose  the 
spectrum  of  the  excitation  source.  This  will  either  be  a dense 
scries  of  sharp  peaks  and  valleys  trom  the  glottal  pulses  in  voiced 
speech,  or  a random  jagged  curve  from  the  white  no'se  in 
unvoiced  'oeech.  In  either  case  the  many  extraneous  peaks  and 
val'cys  mask  out  the  desired  formant  peaks. 

Overview 

Markel's  method  is  a form  of  deconvolution,  or  separating  the 
effect  of  the  driving  function  (unwanted  in  our  case)  from  that  of 
the  driven  system  (the  desired  vocal  tract  response).  Thus  the 
smooth  resonance  spectrum  of  the  vocal  tract  can  be  obtained. 
(The  excitation  signal  can  also  be  identified  and  used  for  pitch 
extraction  ) 

Markel  derives  an  inverse  filter  for  each  frame  of  speech 
signal.  Such  a filter  attempts  to  destroy  the  signal  input,  i.e., 
reduce  it  to  minimum  energy  and  information  content,  either  white 
noise  or  zero  The  frequency  response  of  this  filter  must  be  the 
inverse  of  the  spectrum  of  the  signal  for  which  it  was  designed. 
However,  by  judicious  selection  of  its  length,  the  filter  can  be 
made  capable  of  w ping  out  the  gross  frequency  characteristics  of 
t He  signal  (which  cor  espond  to  the  formant  resonan.es),  but 
unable  to  follow  the  fine  detail  of  the  input  spectrum  (due  to  the 
excitation  source)  Thus  the  filter’s  frequency  response  is 
inverse  to  the  desired  vocal  tract  response,  but  not  to  the 
undesired  excitation.  Since  we  generally  work  with  logarithmic 
(dB)  frequency  response  scales,  and  log(l/x)  ■ -log(x),  we  need 
only  reverse  the  sign  of  the  inverse  filter  response  (as  by  viewing 
it  upside  down)  to  plot  the  frequency  response  of  the  vocal  tract. 

April,  1974  (CMU) 


Nature  nf  Inverse  Filter,  The  filter  is  fmite- 

impulse-respc  ise,  all-zeroes,  and  implemented  in  direct 

feed-forward  form  with  unit  delays.  Such  a filter  of  length  M is 
represented  as: 


A(z) 


M 

•‘•5 


a[i]  » zT(-i) 


Markel's  algorithm  designs  the  filter  for  each  frame  by  specifying 
the  M values  a[  1 ],  a[2],  . . , a[M], 

The  frequency  response  of  any  filter  is  defined  as  the 
spectrum  of  the  output  resulting  from  a Single  unit  impulse  input. 
However,  the  filter  defined  above  will  respond  to  a unit  impulse 
simply  by  outputting  a 1 and  then  reading  ou  its  coefficients  in 
order,  followed  by  zeroes  forever.  Therefore  a discrete  Fourier 
transform  (OFT)  applied  directly  to  the  series 

1.  a[l ] a[M],  0,  0 0,  . . . 

followed  by  magnitude,  logarithm,  and  negation  will  compute  the 
vocal  tract  resonance  spectrum. 


The  Algorithm 

Autocorrelation.  Given  a frame  of  L digitized  speech 
samples,  x[l)  thru  x[L],  the  first  step  in  deriving  an  inverse  filter 
of  M stages  is  to  compute  the  autocorrelation  vector  R • 
r[0],  r[l),  . . . r[M],  where 

kn 

r[n]  * > x[i]«x[i+n] 

i-l 


Markel  claims  [1]  that  much  better  results  are  obtained  when 
the  input  is  multiplied  by  a non-rectangular  window.  Since  our 
own  tests  have  not  refuted  this,  and  our  autocorrelation  method 
permits  windowing  at  low  overhead,  we  precede  the 
autocorrelation  by  a Hamming  windowing: 


x[i]  0.5»(1-C0S(2*PI  * i/L»  * x[i],  1<isL 


Matrix  Inversion  The  filter  coefficients  a[i]  are  obtainod  by 
solving  the  system  of  linear  equations  R*A-B,  where 

A transposed  - [ a[l]  a[2]  . . , a[M]  ] 

B transposed  - [ r[  1 ] r[2]  . . . r[M]  ] 


and 


r[0]  r[l]  r[2]  . 

r[l]  r[0]  r[l  j . 
r[2]  r[l]  r[0]  . 
r[3]  r[2]  r[l)  . 


r,M-l] 

r[M-2] 

r[M-3] 

r[m-4] 


l/M-1  rM-2  rM-3  ...  r[0]  } 

Since  the  "autocorrelation  matrix"  fi  is  symmetric,  positive 
definite,  and  of  Toeplitz  form,  it  can  be  solved  in  k«MT2  steps 
rather  than  the  generally  needed  k»MT3,  k a constant  Our 
method  is  described  under  Implementation. 
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riee£lrijm,  Ar  Mponl  real  DFT  IS  appl.eo  to 
• , a[M  to;  a\*ea  by  N-M-l  zeroes.  Since  the  a[il  are 

-r.' , ti>e  magnitude  o'  the  transform  is  symmetric  about  its  center 
a. id  only  the  firs,  N/2  frequency  bins  need  be  computed. 


Qptjnum  Values,  Market’s  experiments  show  [1]  that  for  a 

POOcl'o nn'M*.1  10  KHZ’  Ce5t  reSU"S  are  ob,ained  Wl,h  M-14  and 
20°"L  300  We  use  M-14  and  L-256.  Our  DFT  has  N-256  mpul 

JbiutS40,Hz,hU‘  °U,PU,S  I2S  b'nS  '°r  * frCquency  resolp"°"  of 


J?gal-time"  Defined,  In  this  paper  we  define  “real  time 
capability  as  maintaining  a frame  rale  of  100/sec  or  better. 


Structure  o£  the  SPS-41 


uu  itJhe  S.PS'41  IS  bullt  bV  Signal  Processing  Systems,  Inc.  of 
Walt h an  Mass.,  costs  about  *30,500,  and  occupies  the  same 
amo.  nt  of  rack  space  as  a PDP-11/20  minicomputer. 


Design  Philosophy 


The  SPS-41  achieves  high  speed  with  modest  hardware  by 
decomposing  algorithms  according  to  the  sometimes-overlooked 
fact  that  even  a numerical  analysis  procedure  spends  only  about 
,0'  1,5  ,lme  confuting  on  the  data,  with  the  rest  divided  about 
equally  between  loco  administration  and  mer  -y  stores  and 
etches.  Any  concurrent  I/O  operations  will  ,ner  reduce  the 
fraction  of  time  devoteo  to  actual  data  processing. 


Assigning  parts  of  a tomputational  task  to  multiple  processors 
r°rr?J°  lhe,r  na'Ure  <calculat'°"'  '°°P  indexing,  or 
- 'IT  /e'Ve5  3 f°rm  °f  parallelism  4u.te  distmc t from  either 
the  ILL IAC-IV  approach  or  , pipelining  of  minis  where  each  does 
one  phase  of  the  Markel  analysis.  The  first  form  ,s  expensive  and 
possibly  not  suitable  for  linear  prediction  computation  The 
second  requires  each  m,n,  to  have  powerful  arithmetic  units 
ardware  multiply),  but  since  each  machine  implements  all  aspects 
of  one  phase,  its  multiplier  is  idle  3/4  of  the  time. 


Putting  one  aspect  of  all  phases  of  the  total  process  on  each 
processor,  as  in  the  SPS-41,  permits  restricting  expensive 
multipliers  to  the  one  section  that  needs  them,  and  conversely 
having  no  loop-testing  facilities  in  the  arithmetic  section.  This  is 
lust  one  example  of  the  cost-saving  specialization  of  processors 
achievable  by  this  form  of  algorithm  decomposition. 


Inct'Vidual  Processor  Characteristic 


Genej^  All  sections  deal  with  16-bit  2's  complement  data 
and  have  a 200-nsec  instruction  cycle  time. 


^niellc  ^LiiOn  The  AS  contains  three  data 

memories,  a read-only  sme/cosme  table,  four  multipliers,  six 
summers  (adders),  and  a 16x64  microcode  store.  The  ba  ic  data 
type  is  a complex  word  consisting  of  real  and  imaginary  halves, 
each  6 bits.  However,  the  AS  allows  the  two  halves  to  be 
treated  separately  as  reals. 


For  a complex  multiply,  each  of  the  four  multipliers  generates 
one  of  the  terms,  and  two  summers  buill  into  the  multiply  section 
form  the  real  and  imaginary  outputs.  Either  the  high  or  low  16 
b Is  of  the  32-bit  product  may  be  taken,  but  not  both.  The 
h.gh/iow  choice  must  be  made  when  the  multiply  is  done,  not 
a terwards  Getting  both  halves  of  a result  for  double  precision 
rf  quires  repeating  the  multiply  with  the  same  inputs.  Products 
may  be  scaled  up  or  down  a maximum  of  two  bits;  if  the  result 
would  overflow,  ,t  saturates  to  +-2U5;  saturation  cannot  be 
disabled  Other  modes  besides  complex  may  be  requested,  e.g., 
conjugate,  matrix,  and  twin  real. 
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There  are  a, so  two  complex  summers  (four  real  adders)  under 
direct  microcode  contro  , plus  one  complex  accumulator  whose 
outputs  can  be  scaled  like  the  products. 


To  reduce  the  tendency  for  processing  to  be  i/0-bound,  the 
AS  has  data  memories  to  enable  buffering,  large-radix  FFTs,  etc 
Ab  data  storage  consists  of  t vo  identical  memories  HI  and  LO  each 
with  64  complex  words,  and  the  COEFF  memory  with  32  ~ These 
may  all  be  read  and  written  during  one  AS  instruction.  The  only 
ROM  in  the  entire  SPS-41  ,s  JRIG,  whose  sines  and  cosines  are 
used  for  Hamming  windows  and  Fourier  transforms. 


The  AS  logic  is  4-bit  byte  serial  and  requires  5 clocks  or  one 
usee  to  complete  an  instruction.  Thus  4M  real  multiplies  and  6M 
adds  per  second  can  be  achieved. 


The  AS  is  a passive  slave  without  even  a program  counter 
The  microinstruction  for  each  cycle  ,s  selected  by  the  Index 
Section,  as  are  the  scale  factors  and  read/write  memoiy 
addresses.  1 


Index  Section  The  IS  ,s  the  controller  for  the  AS 

arely  does  any  data  (meaning  speech-related  data)  pass  through 
't  The  instruction  set  is  oriented  toward  the  byte  shifts,  bit 
extraction,  and  rapid  condition  testing  required  for  loop  indexing 
and  control.  It  has  a 32x16  general  memory  plus  32  16-bit 
registers  including  7 accumulators,  4 control  interfaces  to  the  AS 
and  1 5 trap  registers.  * 


, S 'S  3 'rue  compuler  w'th  a program  counter.  There  are 

only  48  words  of  32-b,t  program  store,  but  these  are  augmented 
y the  Trag  system,  the  most  interesting  feature  of  the  IS.  No 
direct  test,  branch,  or  halt  instructions  exist.  Instead,  4 bits  of 
every  program  reference  one  of  the  15  trap  registers.  (Trap  0 
does  not  exist  and  signifies  "no  test.')  Each  user-loaded  trap 
register  can  hold  16  bits  worth  of  tests  and  branch  address  for 
e price  of  just  4 bits  in  the  instruction.  Furthermore,  any 
instruction  can  test  and  branch  or  halt  on  its  results  for  fr^T 
Thus  both  length  and  breadth  of  program  store  are  conserved. 


lnput-Qutput  Processor  (IQP),  The  DP  interfaces  the  SPS-41 
with  the  outside  world  and  coordinates  the  3 sections  of  the  41. 
7*  3 unlversal  d*v|ce  controller,  where  “device"  includes  the  rest 
o a SPS-41,  anything  attached  directly  to  the  41  ’s  I/O  bus,  or 
anyth  ng  on  the  PDP-11 ’s  Umbus.  The  DP  can  halt  the  IS-AS 
pair  and  later  continue  them  or  re-mitialize  them  and  restart  the 
IS  at  any  point  in  IS  program  store 


Up  to  16  programs  or  "channels'  can  be  timeshared  by  the 
IOP  on  a fixed  priority  basis.  At  each  200-nsec  clock,  the  cycle 
goes  to  the  highest -pr iority  channel  which  is  not  waiting  for  an 
unfulfilled  external  status  condition,  e.g.,  core  memory  fetch 
complete.  (Note  'hat  PDP-11  memory  is  treated  as  a peripheral 
device,  as  suits  "s  relative  slowness.)  Since  there  are  16  copies 
o the  prog  "Vi  counter  and  the  accumulator  files,  the  equivalent 
of  an  interrupt  is  serviced  within  200  nsec  with  zero  overhead 


The  DP  has  a 265x23  program  store  and  a 256x16  data 
memory,  plus  external  registers  for  peripherals  (including  PDP-1 1 
core)  and  four  bidirectional  data  interfaces  to  the  AS.  Each 
instruction  can  operate  on  two  separate  operands  and  put  the 
result  in  a third  location,  making  the  DP  a 3-address  machine 
Separate  instructions  must  be  executed  for  tests  and  gotos  or 
suorcutine  calls,  unlike  the  IS. 
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SPS-41  Implement atmn  of  Markel’s  LPC 

The  System 

The  SPS-41  is  interfaced  as  a peripheral  to  a F DP-1 1/20 
(row  11/40!  mini.  Other  relevant  peripherals  aie  'wo  magnetic 
tape  drives  and  a pair  of  author-  constructed  12-b't  digital  to 
analog  converters  (DACs).  Currently  the  digitized  speech  data 
must  he  imported  from  our  PDP-JO  via  magnetic  tape  The 
PDP-11  reads  successive  frames  of  raw  data  from  this  tape  into 
core,  the  SPS-41  performs  the  Markel  analysis  on  each  frame  and 
wr  tes  its  resonance  spectrum  into  another  core  buffer;  and  the 
PDP-11  writes  this  onto  U • output  tape  mounted  on  the  second 
tape  drive  Each  input  frame  and  its  resonance  spectrum  are  also 
displayed  on  an  oscilloscope  connected  to  the  DACs;  this 
immediate  viewing  is  helpful  in  evalua'mg  the  quality  of  the 
jPS-Al’s  computation.  The  output  tape  is  then  transferred  to 
the  PDP-10  for  more  extensive  reviewing.  While  this  tape  to 
tape  operation  is  hardly  "real  time,"  the  system  is  capable  of  100 
frames/sec  and  will  run  in  real-time  mode  once  the  required 
analog  to  digital  converter  and  clock  have  been  installed  on  the 
PDP-1 1 Unibus.  Quality  and  accuracy  are  still  the  major  goals,  as 
we  are  not  yet  fully  satisfied. 


Implementation  of  3 Phases 

The  SPS-41  analyzes  the  signal  in  three  phases:  Hamming 

window  and  autocorrelation;  matrix  solution  for  the  filter 
coefficients  a[i];  and  the  log  magnitude  DFT  of  the  a[i].  Since 
the  programs  can  not  all  fit  in  the  41  (especially  the  IS),  a small 
swapper  program  residing  in  the  I0P  is  called  at  the  conclusion  of 
each  phr-.^e  to  roll  in  the  programs,  constants,  and  dala 
initializations  for  the  next  phase.  Since  the  I0P  can  access  any 
memory  in  the  41,  it  can  load  these  from  special  core  images 
called  overlays.  The  swapper  is  loaded  by  the  PDP-11  while  the 
41  is  in  external  (passive  slave)  mode. 

Autocorrelation,  The  usual  scheme  fir  the  short  term 
autocorrelation  R of  an  input  X of  length  L up  to  the  Mth  lag  is: 

cea[  array  r[0:M],  x[  1 : L];  real  sum:  integer  n.  i: 

for  n:-0  step  1 until  l do  begin  "lag_sums" 
sum:-0.; 

for  i:-l  step  1 until  l-n  do  sum:-sum+x[i]*x[i+n]; 
r[n]:«sum; 
end  "lag_sums"; 

(Note  . "Real"  and  "Integer"  are  used  here  onty  to  distinguish  data 
and  indices,  respectively.  "Real"  values  are  of  necessity 
fixed-point  numbers  in  the  SPS-41.) 

This  procedure  reads  most  of  the  x[i]  2(M+1)  times.  The 
accessing  pattern  is  M+l  sweeps  thru  the  array  X.  tn  an  ordinary 
computer  this  is  no  proble  , but  1-256  points  will  not  fit 
conveniently  in  the  41  ’s  AS  data  memories,  and  larger  numbers 
will  not  fit  at  all.  Thus  t.  e 41  would  make  almost  2*(M+1)*L 
reads  from  PDP-11  core  with  the  above  procedure,  even  though 
there  are  only  L unique  values.  Since  core  is  to  the  41  what 
drums  are  to  a conventional  computer,  the  memory-boundedness 
of  the  above  scheme  is  intolerable. 

Note  that  each  x[ij  M<i<L-M,  is  involved  in  2M+1  products:  M 
with  x’s  of  lower  index  , one  with  itself,  and  M with  x’s  of  higher 
index.  Using  just  enough  AS  mei.iory  to  hold  an  x[i]  and  the 
o'her  x’s  involved  with  it,  we  can  compute  all  of  x[i]’s 
contributions  to  the  lag  sums  while  x[i]  is  in  the  AS  memory;  thus 
ea""h  x[i]  need  be  fetched  from  core  inly  once.  In  our  case, 
M-  4<<L-2hb,  giving  a definite  reduction  in  AS  memory  needs. 


We  use  the  3 AS  data  memories  as  follows 


LQ 

COEFF 

Hi 

Bottom 

*[«] 

Xfl] 

r[0] 

x[l+l] 

unused 

r[l] 

x[i  + 2] 

unused 

r[2] 

Top 

x[i+M] 

unused 

r[M] 

To  compute  the  contribution  of  the  p.vot  value  x[i]  in  COEFF  to 
the  partial  sums  r[0]  thru  r[M],  multiply  the  pivot  by  the  x-value 
in  each  row  of  the  LO  buffer,  and  add  this  product  to  t ;e  r-sum  in 
the  same  row,  i.e., 

r[k]:-r[k]  + x[i>x[i+k],  0<k<M 

Now  shift  the  the  LO  buffer  down  one,  discarding  the  x[i]. 
Copy  the  new  buffer  bottom  x[i+l]  into  COEFF  as  the  new  pivot. 
Fetch  the  next  x-value  x[.+M+l]  and  put  it  at  the  top  of  the 
buffer  Then  repeat  the  products  and  sums  Repeat  the  above 
until  x[L]  has  been  fetched  into  the  buffer.  Continue  from  there 
by  fetching  zeroes  in  place  of  the  non-existent  x[>L]’s,  until  x[L] 
has  been  the  pivot.  Then  stop.  (The  procedure  is  initialized  by 
filling  the  buffer  with  x[l]  thru  x[M+l]  with  x[l]  as  the  pivot  for 
the  first  set  of  products.) 

Note  that  the  products  of  x[i]  with  itself  and  x’s  of  higher 
index  are  formed  while  x[i]  is  the  pivot,  and  its  products  with 
lower  indices  occurred  previously  as  x[i]  worked  its  way  down 
thru  the  buffer.  Thus  the  auiu-.r.^L'ion  can  be  computed  using 
just  2M+3  words  of  memory  (including  the  r[i])  and  fetching  each 
input  x only  once. 

Also  note  that  M+l  multiplications  and  additions  take  place 
between  x-fetches,  so  the  data  rate  of  the  input  storage  medium 
may  be  lower  by  a factor  of  at  least  2M,  compared  to  the 
conventional  method.  This  is  impor  ant  in  any  system  using 
two-level,  cache,  or  virtual  storage.  The  single  fetching  of  the 
x[i]  in  irrreasing  order  not  only  assures  optimum  efficiency  under 
Paging  systems,  but  also  suggests  a real-time  autocorrelation 
scheme  in  which  each  data  point’s  contribution  to  the  running  lag 
sums  is  computed  as  soon  as  it  comes  from  the  outside  world.  By 
doing  r[k]:-C*r[k]+x[i]*x[i+k],  where  C is  almost  1.0,  earlier 
contributions  to  the  running  sums  will  exponentially  decay  and 
real-time  displays  of  long-term  continuous  signals  could  be 
displayed. 

Important  here  is  the  elimination  of  a core-to-core  Hamming 
windowing.  Since  each  x[i]  is  fetched  but  once,  it  is  multiplied  by 
the  appropriate  point  on  the  Hamming  weighting  upon  entry  to  the 
AS,  before  being  placed  on  top  of  the  buffer.  Windowing 
contributes  only  77,  overhead  to  our  SPS-41  autocorrelation.  Our 
program  does  not  shuffle  the  buffer  for  each  new  input,  but 
instead  the  tS  maintains  pointers  to  the  buffer's  top  and  bottom, 
which  crawl  around  tha  LO  memory 

The  1 3— bit  speech  input  values  are  regarded  as  ranging  from 
-1.0  to  almost  +1.0,  thus  are  scaled  x2T-13.  The  products  are 
left  at  x2l-ll  scale,  allowing  for  sums  from  -16  0 to  almost  +16. 


Malm<  Solution.  Procedures  for  inverting  this  form  of 
Toeplitz  matrix  date  back  to  Levinson  [2]  and  Robinson  [3],  were 
adapted  by  Markel  [1],  and  later  simplified  by  Markel  and  Gray  [4] 
who  eliminated  3 of  the  7 steps  as  redundant.  This  redundancy 
was  also  discovered  by  the  author.  All  versions  of  the  algorithm 
are  iterative,  and  after  the  nth  iteration  an  nth-order  inverse  filter 
has  been  designed 
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The  algorithm  implemented  on  the  SPS-41  is: 

[£_ai  arr ay  R[0:M],  comment  input  from  autocorrelation; 

A[0  M],  comment  output  filter  coefficients; 

TA[1:M];  commerd  temporary  storage; 
rea[  alpha,  beta,  C;  integer  n,  i; 
comment  Initialize; 
a[0]:»1.0;  alpha  -R[0]; 
for  t;-l  step  1 until  M do  begin 
A[i]:-0.;  TA[i]:»0.i 
end; 

I 

tor  n:»l  step  1 until  M do  begin  "Iterations" 
beta  -0; 

for  i : -0  sfeg  1 until  n-1  do  beta:*beta+AjiJ*R[n-i]; 

C - -beta/alpha; 

tor  i:-l  step  1 until  n do  TA[i]:«A[iJ  + C*TA[n-i]; 
tor  i:*l  step  1 until  n do  A[i]:»TA[i]; 
alpha  ■»  alpha  + C*beta; 
end  "Iferafions"; 

All  arithmefic  is  done  bv  the  AS  except  the  divide,  which  is 
programmed  in  the  K)P  by  the  usual  minicomputer  techniques. 
For  M*14,  the  entire  procedure  takes  less  fhan  one  msec,  ot 
which  divisions  account  for  half.  Scaling  is  x2T-13,  allowing 
values  from  -4.0  to  almost  +4. 

LoR-Mag  DFT.  The  AS  computes  the  complex  Fourier 
transforms  one  at  a time  for  each  frequency  from  0 to  5000  Hz  in 
steps  of  abouf  40  Hz.  Using  complex  conjugate  multiply,  the  AS 
then  finds  the  squared  magnitude  and  passes  both  halves  of  the 
3 2 —bit  product  to  the  IOP.  The  K)P  locates  the  most  significant 
bit  and  encodes  its  position  as  the  5-bit  exponent  of  the  base-2 
log,  and  the  top  7 bits  of  the  normalized  value  are  used  to  index  a 
128-word  lable  of  levs  from  1.0  to  almost  2.0  (currently  kept  in 
PDP-11  core)  to  fill  in  the  8-bit  mantissa.  After  rounding  the 
13-bit  log  fo  12  bits,  the  IOP  writes  it  into  the  core  output  buffer 
and  is  ady  for  the  nex,  value  from  the  AS 

Since  the  AS  computes  a 256-point  OFT  on  only  15  nonzero 
points,  and  only  the  first  half  of  the  results  are  unique,  and  the 
K)P  logarithm  procedure  can  handle  only  one  value  at  a time,  it  is 
just  as  practical  for  the  AS  to  compute  the  Fourier  transform  by 
d rect  integration,  rather  than  by  any  "fast"  FFT  techniques.  This 
phase  takes  abouf  2 msec.  While  a pruned  FFT  could  do  its  part 
faster,  fhe  log-mag  part  would  be  slowed  down;  thus  we  do  not 
intend  to  switch  to  an  FFT. 


Results  and  Conclusions 

Results.  The  system  has  been  fully  operational  since 
January  1974  However,  it  fends  to  give  ok  ously  incorrect 
results  on  strong  voiced  segments  (vowels).  Sibilants  never  fail. 
Early  problems  with  saturated  products  in  computing  Beta  in 
matrix  inversion  and  the  Fourier  sums  were  solved  by  scaling 
down  an  extra  bit  Disabling  the  AS  saturation  logic,  if  possible, 
would  also  have  solved  these  problems. 

Conclusions  The  remaining  problems  lie  in  the  matrix 
inversion  Markel  and  Gray  [4]  show  that  if  Cil.O  at  any 
iteration,  the  inversion  has  failed  due  to  numerical  errors. 
PDP-10  simulations  of  the  fixed-point  arithmetic  reveal  several 
excessive  C-values  during  those  speech  frames  for  which  faulty 
SPS  41  output  is  observed.  Excessive  C’s  usually  result  from 
relatively  small  values  of  Alpha  and  Beta;  this  suggests  too  much 
loss  of  significance  in  the  x 2 T - 1 3 scaling.  However,  the  current 
scale’s  range  of  -4.  to  +4.  is  often  needed,  so  the  conclusion 
seems  fo  be  that  more  bits/word  are  needed.  Markel  and  Gray 
[4]  state  that  23  bits  are  required.  Thus  future  efforts  will 
probably  be  devoted  fo  converting  some  of  fhe  operations  to 
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double  precision;  the  question  is,  which  operations  can  be  safe  y 
left  in  single'* 

Note  that  taking  the  high  half  of  a product  amounts  to 
truncation,  without  roundoff.  We  have  altered  the  matrix 

inversion  AS  program  to  achieve  rounding  at  no  extra  cost  as 
follows:  Appropriate  constants  are  kept  in  the  Imaginary  halves 

of  all  data  words,  such  thaf  their  contribution  to  a complex 
conjugate  multiply  is  one-half  the  value  of  the  LSB  of  the  h gh 
product.  The  contribution  of  this  rounding  frick  has  not  been 
fully  tested. 

Could  the  autocorrelation  phase  be  partly  responsible? 
Simulations  show  no  overflows  or  saturations  However,  since 
each  r[i]  is  the  sum  of  256  truncated  products,  each  r[i]  is  low  by 
a random  variable  distributed  equally  from  0 to  255,  with  a mean 
error  of  28.  It  would  probably  help  to  add  128  to  every  final 
sum.  However,  such  errors  should  matter  most  for  the  small  sums 
computed  on  fricatives,  which  our  system  handles  well,  and  matter 
least  on  loud  vowels,  when  we  fail!  We  must  still  conclude  that 
matrix  inversion  is  the  weak  link,  although  better  accuracy  in 
distinguishing  weak  fricatives  would  no  doubt  result  from  double 
precision  autocorrelation. 

Differentiation  of  fhe  input  by  fhe  IOP  will  be  tried  soon;  not 
only  will  this  lower  the  failure  rate  of  matrix  inversions  (by 
reducing  the  low-frequency  energy  which  contributes  fo  Cll.0), 
but  if  will  also  compensate  fhe  overall  -6  dB/ocfave  vocal  tract 
characteristic  and  show  the  2nd  and  3rd  formants  better  [4], 

Future  Work.  Top  priority  naturally  goes  to  improving  the 
accuracy  of  the  system,  as  by  using  double  precision  on  the 
matrix  and  perhaps  the  autocorrelation.  The  latter  can  be 
speeded  up  by  a factor  of  at  least  2 and  probably  4;  presently 
autocorrelation  uses  only  one  of  the  four  AS  multipliers. 
Reprogramming  will  reduce  the  Phase  1 time  from  5 to  1.5  msec. 
True  real-time  operation  will  occur  when  the  system  becomes  part 
of  the  Hearsay  I1  implementation  on  C.mmp  [5].  Even  though 
double  precision  may  be  required  throughout,  we  still  believe  that 
real-time  estimation  of  vocal-trad  resonances  is  possible  on  the 
SPS-41. 
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ABSTRACT 

An  A-0  and  D-A  convartar  system  with  exceptionally  wide  dynamic  range  and  low 
distortion  is  discussed.  The  converters  include  a special  track  and  hold  circuit  which 
eliminates  slewing  distortion,  active  low  pass  filters,  and  data  buttering  queue. 


Introduction 

Traditional  12-bit  analog-digital-analog  conversion  ot  high 
quality  audio  is  becoming  insufficient  tor  audio  analysis  and 
synthesis  research  The  need  for  greater  dynamic  range  and  low 
distortion  has  led  to  the  development  ot  a 16-bit  converter 

system  at  Carnegie-Mellon  University  designed  specifically  for 
audio  service.  The  system  has  a total  dynamic  range  Of  90  dB., 
and  less  than  0.1  percent  distortion  and  noise  at  large  signal 

amplitudes  Conversion  periods  trom  20  microseconds  to  150 
microseconds  are  programmable  and  an  appropriate  low  pass  tiltor 
is  selected  automatically  Oirect  memory  access  to  a 
minicomputer  and  a 64  word  data  queue  provide  simplified 
programming. 

Conversion  Technique 

Figures  I and  2 show  schematically  the  operation  of  the  DAC 
and  ADC  respectively  Rather  than  use  lull  16-bit  converters, 

the  system  first  prescales  the  16-bit  digital  (or  analog)  signal 

to  form  a quasi— tloati ng— point  number  Twelve  bits  beginning 
with  the  tirst  significant  bit  are  taken  as  a floating-point 

traction  while  a 3-bit  "exponent"  signifies  the  position,  or 

magnitude,  of  the  "fraction"  Only  the  12-bit  "fraction"  is 

converted  and  atterward  the  analog  (or  digilal)  signal  is 

postscaled  by  the  "exponent"  to  restore  proper  magnitude  This 
technique  extends  the  dynamic  ranga  ot  12-bit  conversion  by  24  dB 
without  incurring  the  expense  and  stability  problems  ot  true 
I ® “bit  converters  As  with  conventional  designs,  track  and  hold 
circuits  are  employed  on  the  DAC  to  deglitch  the  converter,  and 
on  the  ADC  to  permit  successive-approximation  conversion 

Track  and  Hold 

It  is  not  generally  recognized  that  the  DAC  track  and  hold 
can  create  considerable  distortion.  The  usual  transition 
behavior  of  commercial  track  and  hold  circuits  consists  ot  a slew 
period  followed  by  quick  and  exact  settling  to  the  new  signal 
leve1  Because  this  transition  slewing  is  not  superposition 
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linear,  heterodyning  ettects  between  the  input  signal  and  the 
sampling  clock  may  occur.  For  example,  two  microseconds  is  a 
typical  slewing  time  for  a tull  scale  transition.  If  a maximum 
amplitude  sinusoidial  input  signal  ot  7 KHz  is  sampled  at  a rate 
ot  20  KHz,  a 1 KHz  heterodyne  ot  approximately  -35  dB  amplitude 
will  be  produced  In  this  case  the  input  signal  is  sampled  three 
times  per  cycle,  and  the  resulting  slewing  assymetries  repeat 
every  seven  cycles. 

Changing  the  track  and  hold  transition  behavior  to  a simple 
exponential  decay  results  in  non-slewing  transitions  which 
maintain  superposition  linearity.  Although  long  settling  time 
makes  exponential  decay  useless  tor  most  commercial  applications, 
audio  signals  incur  only  slight  changes  of  amplitude  and  phase. 
The  track  and  hold  designed  for  the  16-bit  converters  has  an 
exponential  time  constant  of  about  0.5  microsecond  and  the 
resulting  slight  high  frequency  roll-ott  can  be  compensated  by  an 
external  network. 

As  a further  modification  for  audio  service,  overall  DC 
feedback  may  be  used  around  the  track  and  hold.  Since  the  audio 
signal  is  sampled  linearly  and  no  heterodynes  (including  DC)  are 
tormod,  the  output  can  be  integrated  and  fed  back  to  suppress  any 
DC  errors.  The  complete  D-A  system  diagram  in  tigure  3 shows 
that  the  DC  feedback  loop  includes  all  amplifiers  to  the  output 
connector  where  an  ottset  of  less  than  one-half  LSB  can  be  easily 
maintained.  The  complete  A-D  system  pictured  in  figure  4 uses  a 
digital  integrator  and  a small  DAC  to  maintain  zero  digital 
offset  in  the  output  data. 

' ow  Pass  Filters 

Any  audio  conversion  system  clearly  must  have  low  pass 
filters  commensurate  with  system  quality  The  frequency- 
dependent  negative  resistance  (FDNR)  active  tiller  configuration 
permits  the  design  ot  component  tolerant  filters  with  very  low 
distortion  and  wide  dynamic  range  [1].  Because  varying  audio 
requirements  make  the  optimization  of  filter  parameters 

difficult,  the  filters  were  built  as  easily  modifiable  modules 
Any  standard  configuration,  ladder  filter  ot  order  nine  or  less 
may  be  implemented  by  changing  a few  resistors  Standard  ninth 
order  elliptic-function  values  (to  I porcent  tolerance)  proscntly 
are  being  used.  Passband  equals  87  percent  of  the  Nyquist 
frequency  with  0.5  dB  measured  ripple  The  stopbmo  attenuation 
at  the  Nyquist  frequency  and  above  measures  greater  than  68  dB, 
and  signal  to  noise  ratio  (20  KHz  bandwidth)  exceeds  95  dB. 
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Soveral  conversion  ratos  are  commonly  required  by  the  usor 
community  To  facilitate  eare  of  operation,  four  low  pass 

tilters  with  ditforing  cut-ott  trcquencios  are  installed  in  both 
the  A-D  and  O-A  A butter  amplifier  is  necessary  at  the  output 
ot  each  tiltor  and  includes  a peaking  network  to  compensate  high 
frequency  roll-otf  phenomena  (including  track  and  hold)  which  are 
a function  ot  conversion  rate  [2],  A peak  of  about  6 dB  is 
required,  and  the  networks  provide  a few  dB  of  additional 
stopband  attenuation 

Systom  Features 

The  systems  are  designed  for  convenient  user  operation. 

Figures  3 and  1 show  that  the  A-D  and  D-A  are  independently 

interfaced  to  a F DP-1 1 minicomputer  Data,  usually  divided  into 
large  blocks,  is  transfered  by  direct  memory  access  (DMA) 
Processor  attention  is  not  required  except  for  interrupt  service 
at  the  completion  of  each  block  transfer  During  these 

interrupts,  a 64  word  first-in-tirst-out  (FIFO)  queue  provides 

several  milliseconds  ot  buttering  to  permit  continuous  data  flow 
without  critical  interrupt  timing 

A crystal  clock  divider  provides  four  program  selectable 
conversion  clocks  between  20  microseconds  and  150  microseconds 
Programing  the  clock  rate  simultaneously  connects  the  appropriate 
low  pass  filter  from  the  set  of  four  filters.  Timeout  circuitry 
clears  the  converters  and  FIFO's  between  user  operations  to 
eliminate  annoying  clicks  at  the  startup  and  conclusion  of 
conversion 

For  monitoring  purposes,  the  D-A  has  provision  to  echo  the 
A-D  output  independently  of  the  processor  All  of  the  D-A  inputs 
including  clock  and  filter  selection  automatically  switch  to  echo 
mode  for  the  duration  of  A-D  operation. 

Performance  Tests 

The  D-A  system  was  tested  by  converting  perfect  digital 
smewaves  of  varying  amplitude  and  frequency  The  fundamental 
sinewave  was  -amoved  from  the  analog  output  of  the  converter 
system  with  a compensated  twin-tee  filter  and  the  resulting 
residue  (all  noise,  harmonic  distortion,  and  heterodynes)  is 
plo'.tod  in  figure  5.  For  lew  amplitude  signals,  the  random  noise 
of  the  active  filter  and  the  DAC  quantization  noise  are  abouf  3 
dB  above  fhe  theoretical  minimum  quantization  noise  of  the 
conversion.  As  the  peak  sinewave  amplitude  is  increased  above 
twelve  bits,  conversion  noise  rises  because  of  the  floating-point 
operation  of  the  converter  which  truncates  low  order  bits.  In 
this  region  total  rosiduo  is  abouf  0.03  percent,  and  harmonic 
distortion  bocomes  noticable  only  near  maximum  amplitudes.  The 
increase  in  residue  at  0 dB,  12  KHz  input  is  a heterodyne  caused 
by  slight  distortion  in  the  active  low  pass  filter 


The  A-D  system  test  was  somewhat  cumbersome  but  provides 
preliminary  intormation  until  a through  test  can  be  implimented 
[3]  Similar  to  the  D-A  test,  a low  distortion  sinewave  (noise 
and  distortion  more  than  80  dB  down)  was  converted  and  the 
fundamental  subtracted  (digitally)  from  the  output  The  residue 
was  fhen  digitally  amplified  and  reconverted  to  analog  for 
examination.  Figure  6 shows  a noise  shelf  of  about  90  dB  : about 
8 dB  above  the  theoretical  minimum.  This  higher  levol  is 

partially  attributable  to  track  and  hold  sampling  of  high 
frequency  noise  from  the  filters.  In  general,  quantizing  noise 
at  all  input  amplitudes  Is  higher  because  of  sensitivity  of  the 
analog  circuits  driving  the  12-bit  ADC  Both  of  these  factors 
hopefully  can  be  reduced  in  the  near  future. 

Conclusion 

A 16-bit  A-D-A  conversion  system  has  been  designed 

specifically  for  high  filelity  audio  service  The  system 
utilizes  floating-point  approximation  at  conversion,  a linear 

track  and  hold  circuit,  and  overall  DC  feedback.  Major  user 
features  include  easily  modified  low  pass  fil'ers,  a DMA 
minicomputer  interface,  and  a 64  word  data  suffer  System 
performar.ee  approaches  theoretical  limits. 
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Ftguro  I.  Operation  of  lh«  16-bit  DAC 


Figur*  2.  Operation  of  lh»  16-bit  ADC 
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Figure  4 The  complete  VD  system 


